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It seems as if the fundamentals of how we produce vowels and how 
they are acoustically represented have been clarified: we phonate and 
articulate. Using our vocal chords, we produce a vocal sound or noise 
which is then shaped into a specific vowel sound by the resonances 
of the pharyngeal, oral, and nasal cavities, that is, the vocal tract. Ac- 
cordingly, the acoustic description of vowels relates to vowel-specific 
patterns of relative energy maxima in the sound spectra, known as 
patterns of formants. 


The intellectual and empirical reasoning presented in this treatise, 
however, gives rise to scepticism with respect to this understanding of 
the sound of the vowel. The reflections and materials presented pro- 
vide reason to argue that, up to now, a comprehensible theory of the 
acoustics of the voice and of voiced speech sounds is lacking, and 
consequently, no satisfying understanding of vowels as an achieve- 
ment and particular formal accomplishment of the voice exists. Thus, 
the question of the acoustics of the vowel—and with it the question of 
the acoustics of the voice itself—proves to be an unresolved funda- 
mental problem. 
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Introduction 


Topic and Aims 


The vocal cords—when oscillating and modulating air expelled from 
the lungs— produce a sound (a source sound), which is transformed 
by the resonances of the pharyngeal, oral and nasal cavities: depend- 
ing on the position of the larynx, velum, tongue, lips and jaw, different 
shapes of these cavities are formed thus creating different resonance 
characteristics, allowing different vocal sounds (phones) to be pro- 
duced and perceived accordingly. If a vocal sound is perceived to be- 
long to a particular linguistic unit (more precisely, a basic linguistic unit, 
a phoneme), and if the cavity formed by the pharynx and the mouth re- 
mains open, then the sound produced is referred to as a vowel sound 
and its linguistic identity as a vowel quality or simply as a vowel. 


The prevailing theory of vowel acoustics begins with such formulations, 
or similar ones. According to this theory, with respect to human utter- 
ances, the vocal cords produce a general sound, which is transformed 
into a specific vowel sound by the resonances of the (supralaryngeal) 
vocal tract: as human beings, we phonate and articulate. 


Because of this, vowel sounds, as sounds, are expected to exhibit rel- 
ative spectral energy maxima in those frequency ranges that corre- 
spond to the resonances of the vocal tract during speech production. 
These spectral energy maxima are known as formants. 


Such a perspective gives rise to the prevailing psychophysical princi- 
ple of the vowel: vowel sounds that are perceived as having the same 
vowel quality have similar formant patterns, that is, similarly patterned 
relative spectral energy maxima. By contrast, vowel sounds that are 
perceived as different vowel qualities have dissimilar formant patterns. 


At first glance, such a conception of vowel production and of the sub- 
sequent physical representation of vowels seems plausible or even 
self-evident. Our vocal cords do vibrate when we speak, we do move 
our mouths (more precisely, our articulators) to form different vocal 
sounds, and we are indeed often able to “lip read” the words uttered 
from such movements, an ability highly developed by deaf people. 


Moreover, the vast majority of statistical investigations seem to confirm 
the correlation between vowels and vowel-specific formant patterns. 


Vowel synthesis, transforming artificial source sounds by filters, have 
also proven to be very capable of producing recognisable vowel sounds. 
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From such a perspective, existing problems in analysing and determin- 
ing the physical characteristics of vowel sounds according to the per- 
ceived vowel quality are not considered with regard to the principle of 
prevailing theory, but they are related to the dynamics and complexity 
of the production and perception of speech. Furthermore, isolated vow- 
el sounds, for which a simple and statistical correspondence between 
the perceived vowel quality and its specific formant pattern is to be 
expected, are often considered as playing only a marginal role in every- 
day speech. In speech, vowel sounds and perceived vowel qualities are 
generally embedded in syntactic and semantic contexts, in contexts 
of other vocal sounds and of meaning. Such embedded vowel sounds 
exhibit distinct dynamic processes and above all transitions from one 
sound to another. Thus, vowel sounds may be perceived in speech even 
if distinct, static sound elements are absent, and a vowel sound isolat- 
ed from speech as a sound fragment may be perceived as a different 
vowel quality than the same sound in connected speech. This explains, 
for example, why speech can remain intelligible even when substantial 
interferences or transformations affect its transmission. And so on. 


Consequently, the current scientific discussions mainly focus on spe- 
cific matters such as different types of phonation and articulation when 
producing vowel sounds, sound variations and dynamic processes re- 
lated to the respective syntactic and semantic context, sounds pro- 
duced by speakers of different age and gender and corresponding nor- 
malisation attempts, attempts to improve formant pattern estimation 
and attempts to relate acoustic findings and processes of auditory 
perception. And so on. 


Having said that, notwithstanding, the present consideration returns to 
the basic assertion of the current acoustic theory of the vowel cited at 
the beginning of this introduction. It presents a critical reading, indeed 
a falsification, of this assertion. Further, it seeks to demonstrate that 
whereas prevailing theory indicates (is an index of) the actual physi- 
cal characteristics of vowels, it fails to designate these characteristics 
adequately. As such, this work highlights an unresolved fundamental 
problem of the voiced speech sound, and thus of the voice as such, 
and raises this problem once again for discussion. 


The form of this treatise is, in part, unusual in a scientific context. How- 
ever, with the exception of the four aspects discussed below, this in- 
troduction dispenses with lengthy prefatory explanations. In its course, 
the argument and its form of presentation should become self-evident. 
Besides, additional comments in the afterword further expand on, and 
hopefully clarify, matters. 


2 Introduction 


As mentioned, however, four introductory aspects are to be explained at 
this juncture. They concern linguistic expression and style, referencing, 
the significance of argumentation and the perspective adopted here. 


Many parts of the main body of the text are “abstract” in their pres- 
entation, which is to say, they are “technical”. This might complicate 
the reading. Moreover, with the exception of Sections 1.10, 2.1 and 
2.2, the text is not accompanied by illustrated examples or tables list- 
ing statistical data. Further, from Part Ill onwards, the text requires the 
reader to reflect thoroughly on the prevailing theory of the vowel as 
presented in Part I. The text also calls upon the reader to approach 
the related terms and concepts and the statistical values for formant 
patterns with a certain amount of self-assurance. However, such a pro- 
cedure is necessary: the text insists on the discussion of a few fun- 
damental reflections and general facts, and their interrelations, in the 
attempt, as mentioned, to highlight a fundamental problem. 


Most of the issues considered here have already been discussed in the 
literature, and most of the corresponding publications were presented 
by other authors. However, they have often been interpreted in a way 
that differs from the point of view taken here. Yet, aside from the illus- 
trations and tables mentioned, the text largely dispenses with explicit 
references to previous studies, including our own, so as to pursue its 
main argument without any detailed discussion and referencing of in- 
dividual aspects. The Materials section (for the structure of this text, 
see below), however, includes a considerable number of citations, to- 
gether with references to existent publications. Moreover, as mentioned 
above, my colleagues and | have discussed most of the aspects ad- 
dressed here elsewhere. The present text is new in its course of argu- 
ment, as is the arrangement and presentation of citations, comments, 
illustrated examples and outlines of experiments in the Materials and 
Experiments sections. However, new content but concerns aspects 
discussed in Part V and in the afterword, some presentations in the 
Materials section (see Sections M8.2, M10-A) and some examples in 
the Experiments section. 


The empirical basis of this treatise, to which many of the statements 
made here refer, above all in Part III and IV, consists of recordings from 
various areas of everyday life, the entertainment sector and art, that is, 
stage voices in music and straight theatre. Whereas one part of these 
recordings forms the basis of single, published investigations under- 
taken in the past, another part is unpublished and the corresponding 
recordings have not been subject to any further identification tests, 
apart from the identification by the author. Thus, the reflections in Part 
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Ill and IV lay no claim to consistent verification in terms of the exist- 
ing scientific standards. Instead, they are formulated as hypotheses in 
view of general findings that are conceivable or even predictable. In 
line with this, illustrated examples are given in the Materials section. 


Accordingly, this treatise is limited to presenting and interrelating those 
reflections, experiences and observations anew that tend to refute the 
assertion that vowel qualities are physically represented by formant 
patterns. If this undertaking proves successful, then—to repeat and 
insist—this once again raises the question of the voiced speech sound 
as a fundamental problem. 


The argument focuses on and is limited to the relationship between 
individual vowel sounds, perceived vowel qualities, corresponding sound 
spectra and formant patterns in the sense of patterns of formant fre- 
quencies. Formant bandwidths and amplitudes, to mention two as- 
pects of possible importance, are not discussed in detail. 


This treatise adopts a decidedly psychophysical perspective. Only gen- 
eral reference is made to the production and perception of sounds: sound 
production is referred to because the concept of formants itself refers 
to vocal tract resonances and also because this relationship needs to 
be emphasised repeatedly in the course of the argument. Sound per- 
ception is referred to because the reflections presuppose that the vowel 
sounds discussed can be attributed to (perceptually identified as) the 
specific vowel qualities in question. Beyond these general references, 
however, production and perception are not further discussed. 


By no means does excluding a consideration of further details of sound 
production and perception from the present discussion suggest that 
these aspects are unimportant for the physical description of vowels. 
Doing so merely serves to focus on the psychophysical question of the 
vowel: given that an utterance—or its reproduction, manipulated or not, 
or a synthesis for that matter—is perceived as a specific vowel quality, 
which describable physical characteristic or which ensemble of physi- 
cal characteristics may be said to represent that quality? 


In line with this, the argument focuses on voiced oral vowel sounds 
produced either in isolation or isolated (extracted) from syntactic and 
semantic contexts. Thus, nasalisation and the syntactic and seman- 
tic context are as such also excluded from discussion. With regard to 
the different types of phonation, only whispered vowels are considered 
here, and are mentioned only briefly. Again, this is intended to enable the 
straightforward discussion of the psychophysical question of the vowel. 
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In no way does limiting the consideration to voiced vowel sounds iso- 
lated from syntactic and semantic contexts and exhibiting quasi-static 
spectral characteristics suggest that such static spectral characteris- 
tics are absolutely necessary for vowel recognition. Thus, the limitation 
adopted here does not run counter to the phenomena described in the 
literature concerning the possibility of vowel recognition in the case 
of sounds exhibiting predominantly dynamic spectral characteristics. 
This study does, however, refute the conclusion partly drawn in the lit- 
erature that isolated vowel sounds or sound fragments with quasi-stat- 
ic spectral characteristics are essentially less easily recognisable than 
vowel sounds occurring in a syntactic context and associated with dis- 
tinctively dynamic spectral characteristics and transitions, or that the 
former are even insufficiently recognisable. The afterword will return to 
this aspect. 


As this treatise reveals, there is good reason to understand and pur- 
sue the psychophysics of voiced speech sounds as a phenomeno- 
logy: that is, for research not to start from a model and to conduct 
single experiments based on it, but instead from an open-ended and 
continually expanding collection and compilation of vocal utterances, 
together with a simultaneously evolving description of their physical 
characteristics related to perceived vowel qualities. 


With the adoption of such a perspective, it may become understand- 
able why the present treatise, despite its narrow focus on phonetics, 
is not published by a correspondingly specialised university institute, 
but rather by an institute affiliated with an arts university. In contrast 
to many approaches, here there is no assumption of a “normal case” 
of speaking, based on which “other kinds” of utterances are treated 
as “special cases”, such as emotionally tinged utterances with cor- 
responding variations of fundamental frequency and vocal effort, or 
utterances produced with a “head voice”, or shouting, or singing, or 
acting, and so on. Such a view is not borne out either by everyday ex- 
perience or by creative expression. 


In the first instance, vocal utterances and thus speech sounds do not 
obey narrowly restricted norms of production, and the only reliable rep- 
resentation of the human voice and speech that critical reflection and 
the development of an empirical approach can refer to, is the artistic 
or interpretative utterance. Only art is able to represent the “artificiali- 
ty” —that is, the reduction, standardisation and coding —of any specific 
utterance whilst, at the same time, overcoming it, albeit only to some 
extent. Referring to the fact that any utterance is a token, not a type, 
only art involves the quasi-systematic variation of vocal utterances, 
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without which any investigation and consideration of the relationship 
between the sounds produced and the qualities perceived run the risk 
of interpreting findings about concrete and specific utterances as find- 
ings about general characteristics and principles. The afterword will 
return to this point, too. 


Vowel sounds, perceived as isolated single sounds, can be intelligible. 
This fact is central to human voice and speech: vowel sounds must be 
intelligible as such because elementarisation—manifest in the aptitude 
of speech for a phonetic system of writing—is at the core of speech 
and language. Such an assumption underlies the reflections advanced 
here. Consequently, vowel qualities—or rather the differences between 
the vowel qualities of any given language —are considered to be repre- 
sented physically. As this treatise aims to show, it is likely that such a 
representation cannot be derived from a physical model but, instead, 
needs to be described as an achievement of the human voice itself. 


Structure 


This treatise is divided into a main body and the two sections Materials 
and Experiments. 


The main body is divided into five parts, followed by an afterword: 


= Part | reviews the prevailing theory of the physical characteris- 
tics involved in vowel representation. 

- Part II presents reflections that, according to the author's read- 
ing of the literature, oppose the understanding of the theory, that 
is, its intellectual re-enactment and validation. 

= Part Ill formulates several hypotheses about the actual relation- 
ship between vowel sounds, sound spectra and formant pat- 
terns. These hypotheses refer to the recordings mentioned in 
the introduction and to related analyses and observations. 

- Part IV explains why the reflections, experiences and observa- 
tions compiled here falsify prevailing theory. 

= Part V discusses the resulting state of affairs and points to the 
need to devise a phenomenology and to develop a new theory. 
This part also includes an excursus on the harmonic spectrum 
as being vowel specific. 

— The afterword presents various additional comments. 
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The Materials section contains selected excerpts from the literature, 
commented on in part, and presents exemplary series of vowel sounds 
and related acoustic analysis. An extended version of the materials is 
also presented in digital form online; please refer to: 

http://www.phones-and-phonemes.org/vowels/acoustics/preliminaries 


The treatise concludes with a list of possible experiments that allow for 
empirical exploration of the problems discussed here under laboratory 
conditions. 


The main body of this text—excluding Section 13.3 which was added 
to this edition separately —is a revised and translated version of an earlier 
publication in German titled Akustik des Vokals — Praliminarien (Maurer, 
2013). The Materials section is an entirely revised and substantially en- 
larged version of the digitally published sound archive of the German 
version. The Experiments section is new. 


Tables and figures are numbered separately for each chapter. In the 
Materials section, the figure legends are positioned at the top. 


The citations in the Materials section are given in their original version, 
including the corresponding writing style and format. 


If included in the citations of the Materials section, figures referred to 
are not given in this treatise and publications referred to are not listed 
in the References section. For corresponding details, please consult 
the publications in question. 


Terms and Notation 


To facilitate reading, the key terms, notation style and abbreviations 
adopted in the text are explained below. 


Vocal tract. The term “vocal tract” is used as a short form referring to 
the supralaryngeal (or supraglottal) vocal tract in terms of the pharyn- 
geal, oral and nasal cavities. 


Sound, vocal sound, speech sound. The distinction between “sound” 
(Klang, a quasi-periodic sound with a pitch and a harmonic spectrum) 
and “noise” (Gerdusch, a non-periodic sound with no pitch) is made 
in the English version of this treatise only when it matters for the ar- 
gument. In all other cases, the term sound is used as a generic term. 


The distinction between “vocal sound” (Laut, voiced or unvoiced) and 
“speech sound” (Sprachlaut) is made here to refer to the fact that not 
every vocal utterance is linguistic in a narrow sense, that is, not every 
vocal utterance can be attributed to a phoneme. 
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Vowel sound, vowel quality. The term “vowel sound” refers to a single 
concrete vocal sound possessing linguistic value, that is, a phone. It 
is termed a vowel sound—in distinction from other phones—because 
it is perceived to have vowel quality (see below). According to the lit- 
erature, vowel sounds are quoted in square brackets, for instance [a]. 
In part, additional suprasegmental characteristics are also given, for 
instance, in the distinction between [a:] in the German word Kahn and 
[a] as in Kamm (long and short vowel sound). 


The term “vowel quality” denotes a class of vowel sounds of an individ- 
ual language, that is, a phoneme. Thus, concrete single vowel sounds 
as phones are attributed to abstract classes of vowel qualities as pho- 
nemes. In the literature, vowel qualities are quoted between two slash- 
es, such as /a/. 


Vowel qualities are quoted according to the symbols of the Internation- 
al Phonetic Alphabet (revised to 2005). 


Whenever context allows, the terminological distinction between vow- 
el sounds and vowel qualities is shortened to the distinction between 
vowel sounds and vowels, or sounds and vowels. 


In general, the reflections, experiences and observations presented in 
Part Il refer to the long vowels of Standard German /i, y, e, Ø, £, a, O, 
u/. Included here is the vowel /a/, which is encountered in the Swiss 
pronunciation of Standard German. Therefore, the corresponding vowel 
area is assigned as /a-a/, including all allophones of /a/ or /a/. In the 
Materials section, some sounds of the vowel /9/ are also included in or- 
der to discuss the spectral phenomena occurring between /a-a/ and /o/. 


In the text, these vowels are often subsumed under three groups: as 
front vowels /i, y, e, o, €/, as vowel area /a-a/ and as back vowels /9, 
o, u/. The terms “front vowels” and “back vowels” are adopted from 
the literature, but they have no further significance here. In particular, 
their attributed relationship with the tongue position in sound produc- 
tion plays no part. 


Note that, depending on the subject of discussion or demonstration, 
the vowel order sometimes deviates from a consistent front-back di- 
rection. 


The discussion focuses on German vowels because most of the au- 
thor’s experiences and observations to date concern the sounds of the 
German language. However, the corresponding general statements also 
apply to other individual languages. 
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Fundamental frequency. The term “fundamental frequency” refers to 
the measured fundamental frequency of the sound. However, no dis- 
tinction is made in the text between fundamental frequency and pitch, 
because such a differentiation is insignificant to the discussion. Thus, 
both terms are used synonymously. 


Here, FO is used as an abbreviation for fundamental frequency. There- 
by, depending on the context, the abbreviation refers to fundamental 
frequency in general terms or to a specific level (or range) of funda- 
mental frequency in Hz. 


Spectrum, harmonic spectrum. The term “spectrum” refers to the 
sound spectrum of a vowel sound, generally resulting from a of Fourier 
analysis. In certain cases, the term can refer to a spectrogram because, 
in many empirical studies, formant values are appraised or verified on 
the basis of this type of spectrum. Important differences exist between 
these two types of spectral representation. However, because the pres- 
ent consideration concerns only general aspects, with a few excep- 
tions, these differences are negligible here. In the exceptional cases 
referred to, corresponding differentiations will be made. 


The term “harmonic spectrum” refers to a series of harmonics in the 
sound spectrum, a series of partials (Sinusoidal components of a com- 
plex tone) whose frequencies are an integral multiple of the fundamen- 
tal frequency. However, even if this terminology is common, it is not 
unquestionable. Above all, vowel spectra may not always exhibit the 
first (or the first few lower) harmonics (consider, for example, high-pass 
filtering), and the perceived pitch may not always correspond to the 
acoustically measured fundamental frequency. The emerging termino- 
logical question is left open here. 


Relative spectral energy maximum, spectral envelope peaks. The 
term “relative spectral energy maximum” refers to a narrowly delimit- 
ed frequency range of a spectrum that exhibits significantly increased 
energy compared to the frequency ranges immediately preceding and 
immediately following such spectral enhancement. In the literature, such 
relative maxima are in general determined on the basis of evaluating 
a spectral envelope (in the sense of an imaginary smooth line drawn 
to enclose an amplitude spectrum, see Chapter M6) and are termed 
“spectral envelope peaks”. 


Formant, formant pattern, formant statistics. The term “formant” is 
used in different ways in the literature. In particular, it can refer either 
to a resonance as a physical property of the vocal tract, to a spectral 
envelope peak as a physical characteristic of a vowel sound, or to a 
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filter as a part of a series of filters related to an analytical method of 
speech processing. The term can also denote two or even all three of 
these aspects at the same time. 


Here, a basic distinction is made between the resonances of the vocal 
tract and the formants of the vowel sound produced. Such a distinc- 
tion corresponds to the perspective adopted, namely, not to discuss 
the production of a vowel sound but, instead, the vowel sound itself, 
including the related perception of the corresponding vowel quality. 


At the beginning of the present contribution, the term “formant” re- 
fers to spectral envelope peaks as well as to filters used in speech 
analyses, because in the literature, when formulating vowel-specific 
physical characteristics is at issue, both characteristics are generally 
assumed to correspond. In the course of argument, when consider- 
ing current empirical studies and corresponding formant values, it will 
become clear that, today, the concept of vowel-specific formants is 
generally limited to the filters used in speech analyses. 


In the literature, formant abbreviations are often used to distinguish 
between formant frequencies, bandwidths and amplitudes or levels. 
Such a distinction is dispensed with here. Instead, single formants are 
referred to as F1, F2, F3, . . . F(i) and configurations as F1-F2 or F1- 
F2-F3, termed as “formant patterns”. Depending on the context, as is 
the case for FO, these abbreviations refer to formants in general terms 
or to specific levels (or ranges) of formant frequencies in Hz. Formant 
bandwidths and amplitudes play no substantial role in the discussions. 


Accordingly, formants and formant frequencies of vowel synthesis are 
abbreviated as F1’, F2’, F3’, ... F(i)’ and vocal tract resonances are 
abbreviated as R1, R2, R3 ... R(i). 


Note that abbreviations of fundamental, formant and resonance frequen- 
cies with lower case numbers—F,,, F,, F,, F,... —are used only in tables 
showing formant statistics and in citations. 


If references are made to formant values as given in formant statis- 
tics for voiced vowel sounds, corresponding investigations generally 
concern formant measurements for sounds produced in citation-form 
words with medium or spontaneous vocal effort at related fundamental 
frequencies, in a quiet room in front of a microphone. These values are 
often assumed to be representative of so-called “normal speech”, and 
the limitation of measurement in terms of not considering vowel sounds 
produced by single speakers at very different fundamental frequencies 
is often ignored and remains unmentioned. (Please note that, for rea- 
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sons explained in the text and on the basis of observations document- 
ed in the Materials section, we do not consider the expression “normal 
speech” appropriate and, with regard to both fundamental frequen- 
cy and formant patterns, we question the representative character of 
sounds produced in citation-form words for the utterances in everyday 
life. However, the analysis of sounds produced in citation-form words 
may be comparable to the analyses of relaxed speech.) 


For the ongoing debate on terminology and abbreviations, please refer 
to Section M6. 


LPC. The abbreviation “LPC” stands for Linear Predictive Coding, which 
is a method used to analyse the acoustic characteristics of speech 
sounds. 


Indications of frequency ranges and frequency limits. Frequency 
ranges and frequency limits for observed aspects of vowel spectra and 
formant patterns and for methodological considerations are given as 
rough approximations. (Note that the vowel-specific frequency range 
for sounds of back vowels and of /a—a/ is given as<1.5kHz. However, 
for some sounds of /a/, the upper limit of this frequency range may 
exceed 1.5kHz; see Section 2.1, for example.) 


Speaker group. The term “speaker group” is used as a short form for 
age- and gender-specific groups of speakers, that is, children, women 
and men, as they are referred to in the literature. (Note that some schol- 
ars term these groups age- and size-specific speaker groups; others 
differentiate further in terms of age, gender and size.) As explained in 
the text, the differentiation of these three speaker groups is motivated 
by three different average vocal-tract sizes. 


In the literature, age- and gender-specific speaker groups are generally 
given in the order “men, women, children”. However, a systematic ad- 
herence to this order carries with it an age and gender bias and poses 
a corresponding problem. Moreover, it mirrors a tradition in phonetics 
to favour the analysis of men’s voices (See also Chapter M6). If, in this 
text, other studies are referred to, the order of listing accords to the cit- 
ed study. Apart from those cases, the order is inverted. This makes for 
a formal inconsistency of the text. For future investigations in the field 
of phonetics, the standard for the listing order should be discussed 
and an adequate linguistic form should be established. 
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Part | Prevailing Theory and Empirical 
References 


The first part of the main text reviews the prevailing theory 
of the physical characteristics involved in vowel representation. 


1 Prevailing Theory 


1.1 General Acoustic Characteristics of Vowel Sounds 


With respect to human utterances, the following is said to apply: The 
vocal cords—when oscillating and modulating air expelled from the 
lungs— produce a sound (a source sound), which is transformed by the 
resonances of the pharyngeal, oral and nasal cavities: depending on 
the position of the larynx, velum, tongue, lips and jaw, different shapes 
of these cavities are formed thus creating different resonance char- 
acteristics, allowing different vocal sounds (phones) to be produced 
and perceived accordingly. If a vocal sound is perceived to belong to a 
particular linguistic unit (more precisely, a basic linguistic unit, a pho- 
neme), and if the cavity formed by the pharynx and the mouth remains 
open, then the sound produced is referred to as a vowel sound and 
its linguistic identity as a vowel quality or simply as a vowel (see the 
introduction). 


According to this approach, the production of a vowel sound involves 
two quasi-independent processes: the production of sound and its 
transformation by resonance, termed phonation and articulation. Sound 
production or phonation is not vowel specific. By contrast, the respec- 
tive resonance effect or articulation is vowel specific. The two-part mod- 
el arising from such an understanding of speech production is known 
as the source-filter model of speech production. 


Physiologically, the perceived linguistic identity of a vowel sound cor- 
responds to a vowel-specific articulation in terms of an ensemble of 
possible positions of the vocal tract, which produce quasi-identical 
(that is, very similar) patterns of resonances. 


Acoustically, the perceived linguistic identity of a vowel sound corre- 
sponds to vowel-specific spectral energy maxima, which are quasi- 
identical to the vowel sounds of the same vowel quality. In acoustic 
analysis, these spectral energy maxima appear as spectral envelope 
peaks, generally known as formants. 


In cases of whispered vowels, phonation does not involve periodic 
sound, but noise. 
1.2 Language-Specific Acoustic Characteristics of Vowel Sounds 


In general, not all formants of a vowel but only the first two (lowest in 
their frequencies) correspond to a perceived vowel quality. The higher 
formants refer to other qualities of vocal expression. 
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In certain languages, exceptions to this rule concern sounds of high 
front vowels and of r-coloured front vowels. In such cases, the fre- 
quencies of the first two formants of sounds of two vowels are qua- 
si-identical, and only the difference within the respective frequency of 
the third formant corresponds to the difference in the perceived vowel 
quality. 


1.3 Speaker Group-Specific Acoustic Characteristics 
of Vowel Sounds 


In general, children have a considerably smaller vocal tract than adults, 
just as women have a smaller tract than men. Because of this, the 
acoustic correspondence between vowel qualities and formant patterns, 
formulated above in general terms, are related to the different speak- 
er groups of children, women and men in terms of age and gender: 
thus, for each group and the respective average vocal-tract length, 
the sounds of a given vowel correspond physiologically to a specific 
articulation involving a specific resonance pattern, and acoustically to 
a specific formant pattern. 


1.4 Phonation Type-Specific Acoustic Characteristics of Vowel 
Sounds and Limitation to Voiced Oral Sounds 


The geometry, and thus the resonances, of the glottal area of the vo- 
cal tract vary for different types of phonation. Therefore, for example, 
the formant patterns of voiced and whispered vowel sounds of one 
perceived vowel quality differ substantially. Consequently, the acoustic 
correspondence between vowels and formant patterns must also be 
related to the various types of phonation: thus, for each single speaker 
group too, depending on the respective average vocal-tract length and 
type of phonation, the sounds of a given vowel correspond physiologi- 
cally to a specific articulation involving a specific resonance pattern, 
and acoustically to a specific formant pattern. 


Existing empirical reference values for formant patterns—formant sta- 
tistics— predominantly concern voiced vowel sounds produced in ci- 
tation-form words, comparable to relaxed speech with limited varia- 
tion of fundamental frequency. Statistical reference values for vowel 
sounds involving other phonation types are rare. Further, the various 
kinds of phonation are related to different methodological problems 
of formant pattern estimation. The following discussion therefore con- 
centrates on voiced vowel sounds. Only passing reference is made to 
vowel sounds involving other types of phonation. 
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Nasal vowel sounds are also related to specific methodological prob- 
lems of formant pattern estimation and are therefore not considered 
here either. Hence, the following discussion is restricted to voiced oral 
vowel sounds. 


1.5 Limitation to Isolated Vowel Sounds 


The perception of vowel sounds can depend on the semantic context: 
in some cases, a vowel sound embedded in a syllable or a word may 
be perceived as a certain vowel quality, which, if extracted from the 
context and presented as an isolated sound fragment, may be per- 
ceived to have a different quality. 


Whether or not the perception of vowel sounds can also depend di- 
rectly on their syntactic context, for example when produced in non- 
sense syllables or non-words, is left open here. 


Consequently, the discussion of the acoustic correspondence between 
vowels and formant patterns is further restricted to vowel sounds pro- 
duced in isolation or extracted from a concrete syntactic or semantic 
context. 


1.6 Limitation to Vowel Sounds as Monophthongs 
with Quasi-Constant Sound Characteristics 


In general, single voiced oral vowel sounds that feature a perceivably 
constant vowel quality, a quasi-constant fundamental frequency and 
a quasi-constant loudness throughout their entire duration, exhibit the 
characteristics of a quasi-periodic sound wave. With regard to the phys- 
ical representation of the vowel quality, the corresponding spectral char- 
acteristics of such vowel sounds can be described in terms of the av- 
erage harmonic spectrum of a sound, including the respective spectral 
envelope and, if occurring, its peaks, and with the latter the corre- 
sponding formant patterns. 


This does not apply to vowel sounds whose perceived vowel quality, 
fundamental frequency, or loudness are subject to substantial varia- 
tion. So as to exclude the ensuing questions about a possible influence 
of such variations on the perception of vowel qualities and their spec- 
tral representation, the following discussion focuses on vowel sounds 
as monophthongs that possess quasi-constant sound characteristics. 
Vowel sounds lacking such sound characteristics are again discussed 
only in passing and by way of incidental comments. 
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1.7 Speech Community-Specific Acoustic Characteristics 
of Vowel Sounds 


In the first instance, the acoustic correspondence between vowels and 
formant patterns only applies to speakers and listeners belonging to 
the same speech community: quasi-constant vowel production and 
perception exist among the members of such a community, who ac- 
cordingly attribute sound variations either to one and the same vowel 
quality or to different vowel qualities. 


However, the methodological question of how to determine empiri- 
cally the consistency of such an attribution is not discussed further 
here. The present discussion generally assumes that the vowel sounds 
considered, when subjected to a concrete identification test involving 
listeners of one speech community, specially trained for such a per- 
ception test, will exhibit a consistent attribution substantially above a 
50% level for any given vowel quality. 


Yet to be discussed elsewhere are correspondences that reach beyond 
one particular soeech community as well as one particular linguistic 
community. 


1.8 The Prevailing Theory of Physical Vowel Representation 
Given that 


- vowel sounds are produced by individuals belonging to one of 
the three speaker groups of children, women, or men of a given 
speech community; 

= vowel sounds are either produced as isolated voiced oral sounds 
or as voiced oral sound fragments extracted from their concrete 
syntactic and semantic context of production, with neither tran- 
sitions at the beginning nor the end; 

- vowel sounds are produced with a quasi-constant fundamental 
frequency and loudness and exhibit the characteristics of a qua- 
si-periodic sound wave; 

= vowel sounds are perceived as belonging to one vowel quality 
by other individuals of the same speech community; 


then the following applies to the individual vowel sound: 


- physiologically, its perceived linguistic identity as a specific vow- 
el quality corresponds to a specific position of the vocal tract 
which, by means of (according to their frequency position) the 
first two (in some cases of high front vowels and r-coloured front 
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vowels of certain languages the first three) resonances of the 
tract, transforms the source sound of the vocal cords to that sound; 

— acoustically, its perceived vowel quality hence corresponds to 
the first two (or the first three) lower formants of the sound spec- 
trum. 


Given the same assumptions, for two vowel sounds perceived as two 
different vowel qualities, this implies that: 


— physiologically, the difference in vowel perception corresponds 
to two different positions of the vocal tract, each with a different 
pattern of the lower two (or three) resonances; 

— acoustically, the difference in vowel perception corresponds to 
two different patterns of the first two (or first three) lower formants 
of their respective spectra. 


For the sounds of a particular vowel, albeit produced by speakers of 
different speaker groups, this implies that: 


= physiologically, their perceived linguistic identity as the same vow- 
el quality corresponds to different patterns of the first two (or 
first three) lower resonances of the vocal tract, related to the 
difference in average vocal tract length of the speaker groups 
compared; 

= acoustically, their perceived linguistic identity as the same vowel 
quality hence corresponds to different speaker group-specific 
patterns of the first two (or first three) lower formants of the re- 
spective spectra. 


These formulations are central to the prevailing theory of the physical 
representation of the vowel. 


1.9  Formalising Prevailing Theory 


For isolated, voiced oral vowel sounds that possess quasi-constant 
sound characteristics and are produced by individuals belonging to a 
given speech community and a given speaker group of children, wom- 
en, or men, the following applies: 


= vowel sounds perceived as one vowel quality correspond to quasi 
-identical (that is, similar) R1-R2 (R1-R2-R3 in some cases of 
high front vowels and r-coloured front vowels in certain langua- 
ges) and, at the same time, quasi-identical F1-F2 (or F1-F2-F3, 
respectively); 
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Amplitude 


Amplitude (dB) 


— vowel sounds perceived as different vowel qualities correspond 
to dissimilar R1-R2 (R1-R2-R3, respectively) and, at the same, 


dissimilar F1-F2 (F1-F2-F3, respectively). 


1.10 Illustration 


Figure 1 is an illustration of this prevailing understanding of vowel pro- 
duction and perception, typical of many publications in the field. (The 
illustration is simplified in that it lacks any differentiation of the actual 
characteristics of the source spectrum on the one hand, and of the 
radiation impedance occurring when a sound is emitted into space on 
the other. This differentiation is not discussed further here because it is 
irrelevant to the present argument.) 
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Figure 1. Illustration of prevailing theory. 
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Figure 2 shows examples of spectra, filter curves (LPC curves) and 
formant patterns (maxima of filter curves) of specially selected sounds 
of different vowels. This kind of illustration, which is limited to the 
acoustic perspective, is also widespread in the literature. 
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Figure 2. Examples of sounds of different vowels produced in isolation by adult male 
speakers at fundamental frequencies of 120-140 Hz. Corresponding spectra and filter 
curves (LPC curves) are shown. The examples are specially selected in order to illus- 
trate prevailing theory. 
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2 Prevailing Empirical References 


2.1 General References 


The first extensive statistical study of the correspondence between 
vowels and formant patterns with reference to the three speaker groups, 
children, women and men was conducted by Peterson and Barney 
(1952, see Table 1, and Figure 1). Their study focused on American Eng- 
lish and later became one of the dominant references in the literature. 


Hillenbrand, Getty, Clark, and Wheeler (1995) used new recording and 
measurement methods (digitisation, LPC analysis) as well as an ex- 
tended set of 12 vowels to replicate the classic study of Peterson and 
Barney (see Table 2). 


Parallel to Peterson and Barney, Fant (1959) published a statistical 
study of Swedish vowels. However, Fant's study was limited to the two 
speaker groups of men and women (see Table 3). 


Presumably, the vowel-specific formant patterns as given by Peterson 
and Barney (1952) and Hillenbrand et al. (1995) are the most widely 
cited references in general discussions of the physical characteristics of 
vowels. The statistics of Fant (1959) also played an important role inthe 
development of the source-filter theory. 
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Figure 1. Illustration of the distribution of the first two formants for American English 
vowels (Peterson 8 Barney, 1952; data of 76 speakers, 33 men, 28 women, 15 children). 
x-axis = formant frequencies (Hz) for F1; y-axis = format frequencies (Hz) for F2. (Re- 
produced with kind permission of Peterson 8 Barney [1952]. Copyright 1952, Acoustical 
Society of America.) 
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2.2 Empirical Reference for Standard German 


Patzold and Simpson (1997) conducted a statistical study of vow- 
els of Standard German, produced by men and women (See Table 4, 
limited to monophthongs). These values are given here because, as 
mentioned in the introduction, most of the author’s experiences and 
observations to date concern the sounds of the German language, and 
corresponding references are made in the text as from Part Il. 


2.3 Other Statistical References 


References to other formant statistics and additional data of interest to 
the present discussion can be found in the Materials section. Such in- 
formation includes formant statistics for vowels of different languages, 
model-like formant patterns, formant statistics for whispered vowels 
and indications concerning formant patterns of vowel sounds at differ- 
ent fundamental frequencies. 
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2.2 Empirical Reference for Standard German 


Part Il Reflections 


Part Il presents reflections that, according to the author’s reading 
of the literature, oppose the understanding of the theory, that is, 
its intellectual re-enactment and validation. 


31 


3 Vowels and Number of Formants 


3.1 Inconstant Number of Vowel-Specific Relative Spectral 
Energy Maxima in Sounds of Back Vowels and of /a—a/ 


As reported in the literature, when analysing samples of sounds of 
back vowels and of /a—a/, some sounds may exhibit only one distinct 
vowel-specific spectral envelope peak, whereas other sounds of the 
same vowels exhibit the expected two pronounced peaks. 


Empirically, the number of vowel-specific relative spectral energy 
maxima proves to be inconstant for sounds of single vowels. 


3.2 Inconstant Correspondence between Vowel-Specific 
Relative Spectral Energy Maxima and Calculated Vowel- 
Specific Formant Patterns 


If sounds of back vowels and of /a—a/ exhibit only a single vowel-spe- 
cific spectral envelope peak, according to the literature, formant analy- 
sis (e.g. using LPC analysis) often reveals two close formant frequen- 
cies. Such cases are therefore referred to as formant merging. It follows 
that, for the sounds in question, the spectral envelope peak and the 
calculated first two formants do not correspond to one another. 


Yet, if sounds of back vowels and of /a—a/ exhibit two vowel-specific 
spectral envelope peaks, such a correspondence is generally found. 


Thus, the observation of an inconstant number of vowel-specific spec- 
tral envelope peaks of sounds of one and the same vowel calls into 
question the fundamental relationship between spectral envelopes and 
calculated formants. 


No direct parallelism exists between relative spectral energy maxima 
and calculated formants. 


Consequently, formants prove to be constructs of a specific method of 
analysis (see Section 6.1). 
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3.3 Inconstant Number of Vowel-Specific Relative Spectral 
Energy Maxima and of Calculated Vowel-Specific Formants 


As shown in Part I, with regard to high front vowels and r-coloured 
front vowels of some languages, sounds belonging to these vowels 
can exhibit, in part, similar first and second lower spectral envelope 
peaks and formant analysis can reveal similar F1-F2. Thus, the sounds 
of the corresponding vowels are physically distinct only with regard to 
the third spectral envelope peak and the third formant, respectively. 


For such languages, it follows that back vowels, as well as some of the 
front vowels, are physically describable in terms of different patterns of 
F1-F2, whereas the remaining front vowels have to be described only 
in terms of different patterns of F1-F2-F3. 


Empirically, the number of vowel-specific relative spectral energy 
maxima and of calculated vowel-specific formants proves to be in- 
constant among different vowels. 


With regard to spectral envelope peaks, then, the quality of some sounds 
of back vowels is represented by a single peak, the quality of oth- 
er sounds of back vowels and sounds of some front vowels by two 
peaks and the quality of some front vowels by three peaks. 


3.4 Addition: “Spurious” Formants 


In the spectra of the sounds of certain speakers, an additional spectral 
envelope peak may occur between the expected first and second or 
second and third formant. According to the prevailing methodologi- 
cal rules for determining formants, this maximum is not interpreted as 
vowel specific but as a specific characteristic of the speaker’s voice in 
question. Therefore, it is referred to as a “spurious” formant. 


Such “spurious” spectral envelope peaks also need to be considered 
within the context of the inconstant number of vowel-specific spectral 
envelope peaks. 


3.5 Addition: “Flat” Vowel Spectra 


In the literature, some indications for possible vowel perception related 
to “flat” spectral parts, lacking any clearly distinctive relative energy 
maxima, are also given. 
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and of Calculated Vowel-Specific Formants 


3.6 Addition: Inconstant Number of Vowel-Specific Formants 
in Synthesis 


Synthetically produced—and easily recognisable—vowel sounds can 
be generated for most vowel qualities using three- and two-formant 
synthesis. For certain vowels, in particular for back vowels and /a-a/, 
this is also possible by way of a one-formant synthesis. 


With regard to synthesised sounds perceived as belonging to one vowel 
quality, a comparison of the sounds with F1'-F2' (two-formant synthe- 
sis) and the sounds with F1’-F2’-F3’ (three-formant synthesis) reveals 
differences for F2’, in particular for sounds of front vowels. Similarly, 
a comparison of the sounds with F1’ (one-formant synthesis) and the 
sounds with F1’-F2’ (two-formant synthesis) reveals differences for 
F1’. (However, in the corresponding comparative studies, the funda- 
mental frequency used in synthesis the was not varied systematically.) 


Synthesis thus confirms the inconstant number of observable vowel- 
specific formants. Further, synthesis involving different numbers of 
formants (different numbers of filters) indicates differences for F1’ or 
F2’, respectively, although the sounds in question are perceived as be- 
longing to the same vowel. 
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4 Vowels and Fundamental Frequency 


4.1 Fundamental Frequency, First Formant and “Grade” 
of Vowels 


According to prevailing theory, vowel-specific formant patterns are in- 
dependent of the fundamental frequency of their respective individual 
sounds. 


In general, the frequencies of the first formant of all vowels, as speci- 
fied in current formant statistics for sounds produced in citation-form 
words, comparable to relaxed speech, lie within the range of the possi- 
ble fundamental frequencies for the speakers of a given speaker group. 
Concerning long German vowels, the lowest statistical values for F1 
are given for /i, y, u/, medium values for Je, ø, o/, followed by values 
for /e, 9/ and the highest values are indicated for /a—a/. 


If the fundamental frequency involved in producing vowel sounds ex- 
ceeds the frequencies of the first formant of /i, y, u/ and approaches 
the frequencies of the first formant of Je, e, oi then it is to be expect- 
ed that the vowels /i, y, u/ become unintelligible because their first 
vowel-specific formant is no longer physically representable. Thus, the 
vowels /i, y, u/ would be of a “lower grade”, that is, more restricted 
in their production, physical representation and intelligibility than the 
other vowels. The same would apply to /e, a, o/ compared to /£, a, a, 
2/ and to /e, 9/ compared to /a-a/. 


In line with prevailing theory, the possibility that the fundamental fre- 
quency of a vowel sound can exceed the first formant frequency of 
a vowel quality as given in formant statistics leads to the assumption 
that the “grade” of vowels differs because of vowel-specific acous- 
tic characteristics. 


However, everyday experience refutes such a generalising conclusion. 
If speakers of a given speaker group produce vowel sounds, and if 
the fundamental frequency of these sounds exceeds the frequencies 
of the statistically given first formant of /i, y, u/ and approaches the 
frequencies of the first formant of Je, o, oi then all of the six vowels 
mentioned can be produced with the same “grade” of vowel percep- 
tion, given speakers with correspondingly good vocal abilities. There is 
no general impairment of vowel perception for the sounds of /i, y, u/ if 
the fundamental frequency exceeds statistical F1. 
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The same holds true—although it is less obvious in everyday utter- 
ances and only for good voices— for the vowels Je, e, oi produced at 
fundamental frequencies higher than the statistical values of their first 
formant frequencies. 


Speakers with excellent vocal abilities can even produce clearly intelli- 
gible cardinal vowels up to a fundamental frequency that corresponds 
to the highest statistical F1 of all vowels of the language they master. 


In this context, special attention needs to be given to everyday speak- 
ing styles or habits that exhibit a fundamental frequency variation of 
one octave or more. Such styles and habits plainly reveal the sig- 
nificance of the problem of fundamental frequencies above statistical 
first-formant frequencies, confronting the prevailing acoustic theory of 
the vowel. 


Special attention also needs to be given to utterances of stage voi- 
ces (in musical and straight theatre, entertainment, film, television etc.) 
because extensive fundamental frequency variation is one of the hall- 
marks of the singing and speaking voice in the context of art and en- 
tertainment. 


Generally, with regard to a fundamental frequency range up to the 
maximum frequency of the first formant as given in formant statistics, 
no principally different “grades” of vowel perception in relation to fun- 
damental and first formant frequency can be experienced. 


4.2 Fundamental Frequency, Spectral Envelope, 
Formant Pattern and “Grade” of Vowels 


If the fundamental frequency of a sound increases, so too does the 
frequency spacing between the harmonics in the spectrum. As a con- 
sequence, determining the spectral envelopes and their maxima be- 
comes difficult. The same applies to the calculation of formant fre- 
quencies. According to prevailing theory, it is to be expected that the 
“grade” of vowel perception is in general also dependent on the funda- 
mental frequency of the sounds: with regard to fundamental frequency, 
the expected tendency for vowel perception is: the lower, the better; 
the higher, the worse. 


Indeed, considering vowel sounds at higher pitches, many scholars 
interpret these sounds as related to a spectral undersampling of the 
formants. 


However, one does not only have to consider a general interrelation 
between fundamental frequency, harmonic spectrum, spectral enve- 
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lope and expected formant frequencies, but also a formant-specific 
role within this interrelation: depending upon given statistical frequen- 
cy values of vowel-specific formants, comparisons show that sounds 
at higher fundamental frequencies may in some cases exhibit frequen- 
cies and relative amplitude maxima of harmonics that correspond to 
the statistical formant frequencies for the vowels in question, whereas 
the frequencies of the harmonics of sounds at lower fundamental fre- 
quencies lie in between these formant frequencies. For the latter, the 
formants are subsequently expected to appear as envelope peaks ei- 
ther only indistinctly or not at all, and the corresponding vowel percep- 
tion is expected to be impaired when compared to sounds at higher 
fundamental frequencies for which the frequencies of the harmonics 
match statistical vowel-specific formant frequencies. 


Such reasoning leads to the assumption that there is not only a general 
but also a discontinuous relationship between the intelligibility of vowel 
sounds and their fundamental frequency: accordingly, vowel sounds 
at lower fundamental frequencies would, as a rule, be more intelligible 
than vowel sounds at higher frequencies, but vowel intelligibility would 
also depend upon the respective relationships between fundamental 
frequency, harmonic spectrum and vowel-specific formant patterns (as 
given in formant statistics). 


In line with prevailing theory, the relationship between fundamental 
frequency, harmonic spectrum, spectral envelope and expected vowel- 
specific formant pattern leads to the same assumption that the 
“grade” of vowels differs in relation to vowel-specific acoustic char- 
acteristics. 


However, as explained, everyday experience refutes such a general- 
ised conclusion. Thus, a theory of vowels as elements of language that 
formulates an inherently qualitative and at the same time discontin- 
uous relationship between fundamental frequency and vowel percep- 
tion stands in contrast with the—possibly “sensational” —characteris- 
tic of a voiced element of language being independent of pitch within 
the range of intelligible speech. 
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5  Formant Patterns and Speaker Groups 


5.1 Fundamental Frequency, Spectral Envelope, 
Formant Pattern and “Grade” of Vowels Uttered by Children, 
Women and Men 


If one further extends the reasoning developed in the previous chap- 
ter, namely that—according to prevailing theory—the intelligibility of a 
vowel sound is expected to relate to the respective fundamental fre- 
quency of the sound and the (statistically given) first formant frequency 
of the vowel, then, correspondingly, the “grade” of vowel perception 
should also depend upon the speaker group: vowel intelligibility should 
prove to be best for men, average for women and worst for children. 


According to prevailing theory, the above relationship between fun- 
damental and first formant frequencies, spectral characteristics and 
expected differences in the “grade” of intelligibility of different vowel 
qualities leads to the assumption that the “grade” of vowels varies 
for different speaker groups (children, women, or men). 


Everyday experience also refutes this generalisation. Thus, again, a theo- 
ry of vowels as elements of language that formulates a inherently qual- 
itative relationship between age and gender on the one hand, and vow- 
el perception on the other, stands in contrast with the—possibly (yet 
again!) “sensational” — characteristic of a voiced element of language 
being quasi-independent of a speaker’s constitution (if not impaired). 


Vowels as such are related neither to age nor to gender. If direct com- 
parisons of utterances of single speakers show that some speakers 
produce vowel sounds “better” (better in vowel intelligibility) than others, 
then, this has to do with the vocal abilities of the individual speakers in- 
vestigated, not with vowels, speaker groups, or vocal-tract sizes (with 
the exception of very young children acquiring their first language). As 
a rule, vowels, as speech sounds of a given language, can potentially 
be produced with equal intelligibility by speakers of all general speaker 
groups. Vowels are not attributes of an individual, but elements of lan- 
guage. Vowels are “abstracted” from the individual. 
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5.2 One Vowel, Different Formant Patterns 


In the literature, empirical reference values for vowel-specific formant 
patterns are given separately for each speaker group (children, wom- 
en, or men), that is, in group-specific terms (see, for example, Chapter 
2). In the first instance, these differences in formant patterns are not 
explained in terms of varying average fundamental frequencies, but in 
terms of varying average vocal-tract size. 


This view leads to the assumption that each vowel is physically rep- 
resented by three different speaker group-specific formant patterns, 
not only in terms of the different fundamental frequencies, but also in 
terms of the same fundamental frequency: in general, women and men 
are able to produce clearly recognisable vowel sounds at a child’s fun- 
damental frequency —for instance, at around 250 Hz (see Section 2.1; 
note, in this context, that in the statistics of Hillenbrand et al., FO differ- 
ences between women and children do not exceed 20Hz). Given such 
cases of sounds at similar fundamental frequencies, three sounds of 
the same vowel, produced by a man, a woman and a child respec- 
tively, are expected to exhibit three substantially different formant pat- 
terns, despite the similarity in vowel perception. 


According to prevailing theory, the relationship between vowel-spe- 
cific formant patterns and age- and gender-related speaker groups 
leads to the assumption that the physical representation of a vowel 
is based upon different formant patterns. 


Such reasoning also leads to the assumption that women and men 
are capable of producing sounds of a given vowel with fundamental 
frequencies substantially higher than those of children, albeit with sub- 
stantially lower corresponding formant patterns. 


The problem that the particular sound configurations in question pose 
to the theoretical approach discussed here becomes particularly ev- 
ident when considering corresponding sounds of the vowels /a, a, 
9, O, u/, which are low-pass filtered with a cut-off frequency of 2 kHz 
(note that, for these vowels, statistical values of vowel-specific formant 
patterns F1-F2 for all three speaker groups discussed here are given 
as <2kHz): then, neither different fundamental frequencies nor differ- 
ent higher spectral energy configurations can play a role in vowel per- 
ception and can explain why three different patterns of F1-F2 can be 
expected to represent the same vowel. 
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It goes without saying that the above also holds true for the restricted 
comparison between women and men. 


The problem described here becomes particularly acute if, instead of 
natural vocalisations, corresponding sound configurations are studied 
by means of vowel synthesis, applying similar fundamental frequen- 
cies but different patterns F1’-F2’. 


However, in its turn, such a conclusion runs counter the requirement of 
a psychophysical parallel between perceived vowel quality and phys- 
ical representation: formant patterns are either vowel specific, which 
means that clearly distinct formant patterns do not represent the same 
vowel—regardless of the fundamental frequency—or they are, as such, 
not directly vowel specific. According to the first stance, the assump- 
tion of speaker group-specific formant patterns would have to be ques- 
tioned. According to the second stance, the assumption of vowel-spe- 
cific formant patterns in general would have to be questioned. 


5.3 Different Vowels, One Formant Pattern 


Disregarding the comment in the previous paragraph, the pursuit of the 
reasoning developed in Section 5.2 leads to the further assumption that 
a single formant pattern can represent two different vowels: given that 
the sounds of a vowel produced by a speaker of one speaker group 
exhibit higher vowel-specific formant frequencies than the sounds of 
the same vowel produced by a speaker of another speaker group, and 
that the fundamental frequency plays no substantial role in the physi- 
cal representation of the vowel in terms of formant patterns, and also 
given that the vowel-specific formant frequencies of the sounds of the 
first speaker lie within the frequency range of the possible vowel-spe- 
cific formant frequencies of the second speaker, then it must be pos- 
sible to find cases of comparisons of two sounds, each produced by 
one of these two speakers, that exhibit similar vowel-specific formant 
patterns, yet are perceived as different vowels. 


According to prevailing theory, the relationship between vowel-spe- 
cific formant patterns and age- and gender-related speaker groups 
leads to the assumption that a single formant pattern can physically 
represent two different vowels. 


Again, the problem that such sound configurations pose to the theoret- 
ical approach discussed here becomes particularly evident when con- 
sidering corresponding sounds of the vowels /a, a, 9, o, u/, because 
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the vowel-specific formant frequencies of the corresponding sounds of 
all speaker groups are given in formant statistics <2 kHz, and in such 
a frequency range, adults can reproduce sounds exhibiting any of the 
F1-F2 pattern found in sounds of children. The same holds true when 
comparing the sounds of men and women. 


The problem described here becomes particularly acute again if repli- 
cated by means of vowel synthesis, above all including extensive vari- 
ation of the fundamental frequency. 


However, in line with the explanation given above, the assumption of 
a possibility of twofold representation, according to which a single 
formant pattern can correspond physically to the sounds of two differ- 
ent vowels, runs counter to the requirement of a psychophysical para- 
llel between perceived vowel quality and physical representation. At the 
same time, indeed, it directly contradicts prevailing theory. 


This consideration engenders a decided scepticism about the claim 
that vowel-specific formant patterns are both fundamentally and con- 
tinuously dependent upon the speaker group, that is, upon vocal-tract 
size. A fundamental dependence is already difficult to understand from 
an intellectual standpoint because, as mentioned, vowels do not “have” 
an age or gender. Besides, the simple fact that sounds of back vowels 
can be synthesised at fundamental frequencies, observable in sounds 
of children as well as in sounds of men, paradigmatically illustrates the 
problem: if, in synthesis, F1-F2 is changed substantially but the fun- 
damental frequency is held constant, in general, the perceived vowel 
quality also changes, irrespective of whether the F1-F2 of the synthe- 
sis corresponds to a pattern observed for natural sounds of a child or 
of a man. 


At the same time, the above reflection suggests an alternative expla- 
nation for the existing empirical findings, which seemingly provide 
evidence for speaker group-specific formant patterns: vowel-specif- 
ic spectral energy configuration, and with this this calculated formant 
patterns, can depend upon fundamental frequency. 


It is remarkable that, in general, formant statistics deemed worthy of 
reference in the literature do not give frequency values of formant pat- 
terns of the different speaker groups for systemically varied fundamen- 
tal frequencies. Thus, currently, there is no empirical evidence in the 
literature to support the claim that observed, speaker group-specific 
formant patterns of vowels should in principle not be attributed to the 
different—and simultaneously observed— fundamental frequencies of 
the respective sounds but, instead, to different average vocal-tract 


5.3 Different Vowels, One Formant Pattern 41 


sizes. With regard to the first formant for all vowels, and probably also 
to the second formant for back vowels, the present reflection indicates 
that such evidence cannot be furnished. 


5.4 A Gap in the Reasoning 


As indicated, existing formant statistics suggest that, irrespective of 
fundamental frequency and perceived vowel quality, adults are capa- 
ble of producing sounds for almost all variants of F1-F2 patterns as 
found in children’s vowels. Thus, even though adults have larger vocal 
tracts than children, for most vowels, they are nevertheless capable of 
producing sounds that exhibit the same vowel-specific formant pat- 
terns, above all F1-F2, as evidenced for the sounds of children. 


If it is indeed the case that speakers of all three speaker groups are con- 
sidered to be capable of producing the same vowel-specific patterns 
for a substantial part of vowels, then how are the pattern differences 
discussed above to be understood? (Many scholars assume that the 
schwa sound defines the midpoint of a speaker’s vowel space and 
plays a central role for the formant pattern differences discussed: be- 
cause of different average vocal tract lengths and different resonance 
patterns of related open tubes of speakers of different age and gender, 
it is deduced that different vowel-related format patterns mirror differ- 
ent midpoint reference patterns. However, in the present context, such 
an assumption does not dispense from the question posed: sounds 
of schwa, too, can be produced on different fundamental frequencies, 
and the independence or dependence of related formant patterns on 
fundamental frequency for perceptually unaltered schwa quality has 
not yet been clarified.) 


Even though existing statistical values list vowel-specific formant pat- 
terns for children exceeding those for adults, and for women exceed- 
ing those for men, there are nevertheless exceptions: in some cases, 
as shown by some statistics, single vowel-specific formant frequen- 
cies, or even vowel-specific formant patterns F1-F2 or F1-F2-F3, for 
sounds produced by men do not differ from those for sounds pro- 
duced by women; they may even slightly exceed the latter. (Thus, re- 
markably, the formant patterns given by Fant, 1959, for a single male 
and a single female speaker do not show a consistent speaker group 
related difference; see Section 2.1, Table 3. Besides, there are cases in 
which the statistical F1 of women slightly exceeds the F1 of children, 
see, for instance, Section 2.1, Table 2, and the corresponding values 
for the vowel /n/.) This raises the same question as above. 
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The relationship between vowel-specific formant patterns and age- 
and gender-related speaker groups described in terms of prevailing 
theory fails to explain why, despite different vocal-tract sizes, similar 
vowel-specific formant patterns are basically possible at least for 
the majority of vowels but are—according to theory—not realised 
(actually not produced). 


In addition, this formulation could also prove to be generally applica- 
ble: it could prove to be the case that all vowel-specific formant pat- 
terns, F1-F2 and F1-F2-F3 as given in formant statistics for children, 
can also be produced by women and men. (With regard to this aspect, 
utterances of voice-over artists are of particular interest.) 


Repeating and insisting: given a psychophysical perspective, the cor- 
respondence between intelligible vowel sounds and the vowel-related 
physical characteristics must be formulated as such. The formulation 
of speaker-independent and, in a strict and direct sense, vowel-specif- 
ic acoustic features represents the touchstone for any acoustic theory 
of the vowel. 


5.5 Addition: Formant Patterns of Voiced and Whispered 
Vowel Sounds 


Empirical studies comparing voiced and whispered vowel sounds in- 
dicate substantial differences in the formant patterns related to the 
perceived vowel qualities. In particular, the first formant frequency of 
whispered sounds of a given vowel (and, according to some studies, 
the second formant frequency, too) are found on significantly higher 
frequency levels than those of voiced sounds. (As mentioned in Sec- 
tion 1.4, such differences are explained as a consequence of differen- 
ces in the geometry, and thus the resonances, of the glottal area of the 
vocal tract for the two different phonation types in question.) 


This finding relativises again the attempt to establish a direct corre- 
spondence between vowels and formant patterns: the sounds of the 
same vowel can exhibit different formant patterns, not only because of 
different average vocal-tract sizes but also because of different kinds 
of phonation acting upon a configuration of a single vocal tract. 


Moreover, comparisons between published formant frequencies of whis- 
pered and voiced vowel sounds indicate that all F1, and the majority of 
F2<1.5kHz, of whispered sounds produced by men generally exceed 
the corresponding F1 and F2 of voiced sounds produced by women, 
given the same perceived respective vowel identities and notwithstand- 
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ing men’s larger vocal tract. The same applies to a comparison be- 
tween whispered sounds of women and voiced sounds of children. 
Restricted to F1, this also applies to the comparison between whis- 
pered sounds of men and voiced sounds of children. 


This observation relativises in turn the assumption of a correspond- 
ence between vocal-tract size and vowel-specific formant patterns: 
based on the values given in the literature, such a correspondence is 
documented only for sounds of one and the same phonation type, not 
for a comparison of sounds of different phonation types. Besides, it 
should be noted that the frequency differences of the lower formants 
for the sounds of a given vowel, which relate to different types of pho- 
nation, e.g. voiced versus whispered sounds, are in general greater 
than the corresponding formant frequency differences between the 
different speaker groups. 


Thus, most importantly, vowel-related formant patterns produced by 
one vocal tract can differ more than vowel-related formant patterns 
produced by different vocal tracts with very different tract sizes. 


Moreover, referring to Section 5.3, a single formant pattern seems able 
to physically represent different vowels not only if the correspond- 
ing sounds are produced by speakers belonging to different speaker 
groups, but also if an individual speaker varies his or her phonation. 


Such consideration will be discussed further in Part III: comparisons 
between the formant patterns of voiced and whispered sounds, as 
documented in the literature, refer only to the average (lower) funda- 
mental frequency of voiced vowel sounds produced in citation-form 
words, but not to a comparison including a systematic variation in fun- 
damental frequency of voiced sounds. (Such an experimental arrange- 
ment assumes, once again, that formant patterns are independent of 
fundamental frequency and are, therefore, negligible when comparing 
voiced and whispered sounds.) 
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6 Terms of Reference, Methods of Formant 
Estimation 


6.1 Formant and Sound Spectrum 


Given that the terms “resonance” and “formant” are distinguished from 
each other, as a means of distinguishing the characteristics of the vo- 
cal tract from those of the sound spectrum, then the psychophysical 
question of the vowel relates to formants only. According to prevailing 
theory, it is assumed that, in the first instance, the spectrum of a vowel 
sound exhibits determinable relative energy maxima, which are related 
to vowel-specific frequency ranges, and that, as a rule, the frequen- 
cies of these relative spectral energy maxima correspond to calculated 
formant frequencies, for example, applying LPC analysis. (Note that, 
nowadays, formant frequencies are no longer derived as numerical val- 
ues from the spectral envelope but, instead, are calculated as filters of 
an analytical model, although the corresponding numerical results are 
in many cases crosschecked on the basis of a spectrogram.) 


As discussed in Sections 3.1 and 3.2, the sound spectra of back vow- 
els and of /a-a/ can exhibit only one single vowel-specific spectral 
energy maximum, although formant analysis using an analytical model 
(e.g. LPC analysis)—under involvement of “phonetic knowledge” and 
sometimes with interactive manual adjustment of parameter settings — 
indicates two vowel specific formants, often close in frequency. This 
contradicts the assumption that the number and frequency of relative 
spectral energy maxima, that is the envelope peaks, always corre- 
spond to analytically determined formants. 


As mentioned in Section 4.2, due to the increasing frequency spacing 
of the harmonics, the higher the fundamental frequency, the more dif- 
ficult it becomes to determine the spectral envelope and its peaks (for 
further details, see also Section 6.4). This in turn impedes the formu- 
lation of a general correspondence between relative spectral energy 
maxima and calculated formant frequencies. 


Regarding the current procedures used in formant analysis and the 
corresponding numerical values of formant patterns, it follows that in 
many cases—and thus in principle—the term formant often does not 
designate a characteristic of the sound spectrum itself, but instead a 
construct or even artefact of the respective method of analysis. 
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In the current literature, the term formant—if distinguished from res- 
onance—generally refers neither to any actual characteristic of the 
vocal tract nor to any actual characteristic of the sound spectrum. 
The term generally refers to filters of an analytical model. At the 
same time, formants are not determined on the basis of spectra but 
on the basis of such an analytical model. 


Thus, the assumption that a direct correspondence exists between 
resonances as a physical property of the vocal tract, spectral energy 
maxima as a physical characteristic of the vowel sound produced and 
filter frequencies derived from methods used in the acoustic analysis 
of vocal sounds, loses its plausibility. 


6.2 Speaker Group and Vocal-Tract Size 


As discussed, prevailing theory supposes a relationship between vowel- 
specific formant patterns and age- and gender-related speaker groups 
and explains corresponding differences in terms of the respective av- 
erage vocal-tract sizes. 


It can be assumed that some women have larger vocal tracts than 
some men. Comparing the vowel sounds of these female and male 
speakers, the following constellation is of particular interest in the 
present context: the sounds of the female speakers in question exhibit 
fundamental frequencies corresponding to the average fundamental 
frequency values for women in general, as given in formant statistics, 
and the sounds of the male speakers in question exhibit substantially 
lower fundamental frequencies. Then, according to prevailing theo- 
ry, the vowel-specific formants of these female voices would have to 
exhibit lower frequencies— despite comparatively higher fundamental 
frequencies—than the corresponding formant patterns of these male 
voices. 


Extending such consideration, this comparison raises the question of 
a systematic investigation of the relationship between vocal-tract size 
and vowel-specific formant patterns within a single speaker group. 


Besides the lack of an empirical basis for the questions raised here, 
the above reflections again point to the fact that prevailing theory does 
not claim that vowel-specific formant patterns depend in principle on 
age and gender, but that different vowel-specific formant patterns exist 
for different vocal-tract sizes: prevailing theory only refers to speaker 
group-specific differences in average vocal-tract sizes.) 
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The term “age- and gender-related speaker group” is related to the 
term “age- and gender-related average vocal-tract size”. 


6.3 Formant Analysis and Objectivisation 


Concerning natural vocalisations, current analytical methods for de- 
termining formants apply a model-like procedure in order to calculate 
a specific configuration of source sound and filters which, by means 
of transformation of source by filters, “reproduces” a sound that best 
corresponds to the real sound. (The same applies to whispered vowel 
sounds, in relation to the source as noise.) 


Such a procedure must not only assume certain characteristics of the 
source sound but also a certain number and certain characteristics of 
the filters involved in the frequency range under investigation. (Note 
that, according to prevailing theory, different numbers of formants are 
expected for a given frequency range in relation to different speaker 
groups because of their different average vocal-tract size. Thus, the 
number of filters for the analysis of a sound must be set accordingly.) 
How closely the characteristics of the source sound approach actual 
phonation remains open. The same applies to the question of whether 
the number of filters and their characteristics actually correspond to 
real articulation and its resonance. 


Thus, formants cannot be determined reliably on the basis of a vow- 
el sound alone. Analysis requires at least some prior knowledge of 
whether the sound under investigation has been produced by a man, 
woman, or child, assuming that this information is sufficient to deduce 
the number of filters (related to the frequency range of interest) to be 
used in formant analysis. 


Besides, subsequent automatically calculated formant frequency val- 
ues are often double-checked visually on the basis of the sound spec- 
trogram: if the values calculated in the first step —based on analyti- 
cal parameters according to existing standards and known speaker 
group—do not correspond to the relative spectral energy maxima of 
the analysed sound, then the number of filters is varied and analy- 
sis is performed until such a correspondence occurs. As a rule, the 
characteristic of the source sound is not altered. However, this only 
applies to cases where such an interactive analysis is able to produce 
vowel-specific numbers and frequencies of formants that correspond 
to the number and frequency ranges to be expected according to pre- 
vailing theory and established statistical patterns, and which are also 
clearly indicated in the spectrogram. If an interactive procedure of ana- 
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lysis yields no values with such a correspondence, then the respective 
vowel sounds are often excluded from further studies, irrespective of 
vowel perception. Exceptions include so-called “formant merging”, as 
discussed in Section 3.2. 


Thus, current methods of formant analysis presuppose that research- 
ers have the necessary analytical skills, that is, a knowledge of the 
existing phonetic principles and rules of interpretation as well as ex- 
tensive first-hand experience of conducting such an analysis. This 
involves prior training because such an analysis involves contextual 
knowledge, the ability to visually compare numerical values with a cor- 
responding sound spectrogram, together with the ability to interpret 
the latter visually, and also the skills to vary filter settings interactive- 
ly and to perform the repetition of numerical analysis. Consequently, 
methods of formant analysis are not completely objectifiable. If they 
were, then researchers would play no part as individuals in such re- 
search. 


Strictly speaking, methods of formant analysis are not fully objectifi- 
able; accordingly, they cannot be fully automated. 


Most importantly, these procedures are also very time consuming. 
Therefore, investigations based on very extensive samples of sounds 
are problematic with regard to method. This is the case particularly if 
the fundamental frequency is varied: then, specific problems of analy- 
sis aggravate the costly character of the method as such. Obviously, 
this holds true for all repetitions and verifications of existing investiga- 
tions. 


6.4 Formant Analysis, Fundamental Frequency and Speaker 
Group or Vocal-Tract Size 


In addition to formant analysis not being fully objective and automat- 
ed, it also depends on the respective fundamental frequencies of the 
sounds. To repeat: the higher the fundamental frequency, the more dif- 
ficult it becomes to determine the spectral envelope peaks expected 
because the frequency spacing between the harmonics become too 
large to accurately define the spectral envelope. It also becomes in- 
creasingly difficult to determine the formants within any of the existing 
analytical frameworks. 


With regard to critical limits of fundamental frequencies, above which 
methods of formant analysis become unreliable, two kinds of reference 
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values need to be considered: firstly, half the frequency of the lowest 
first formant for a speaker group in terms of an average vocal-tract 
size, and secondly, the frequency of the lowest formant for a speaker 
group. 


For a fundamental frequency above half of the first formant frequency 
(FO> YF 1), the frequency spacing between the harmonics is already so 
extended that defining a spectral envelope and evaluating the calcu- 
lated numerical formant frequencies becomes problematic. (Note that 
for such sounds, the formants may not be clearly indicated by at least 
two harmonics.) According to this first kind of limit, and referring to the 
standard values established by Hillenbrand et al. (1995) for F1 of /i/ 
(the lowest average value for F1 in these reference statistics), formant 
analysis becomes critical for fundamental frequencies higher than: 


— 226Hz for sounds of children (involving short vocal tracts) 
- 219Hz for sounds of women (involving medium-sized vocal tracts) 
= 171 Hz for sounds of men (involving long vocal tracts) 


For a fundamental frequency above the lowest first (statistically given) 
formant frequency for a given speaker group, under the assumption of 
independence of formants from fundamental frequency, it is basically 
impossible to distinguish all F1 of all vowels produced by speakers of 
that group, not to mention the aggravated problem of determining the 
spectral envelope. According to this second kind of limit, and again 
referring to the above statistics, methods of formant analysis lack a 
methodological basis for fundamental frequencies higher than: 


— 452 Hz for sounds of children (involving short vocal tracts) 
— 437 Hz for sounds of women (involving medium-sized vocal tracts) 
= 342 Hz for sounds of men (involving long vocal tracts) 


Note that referring to the statistics of Patzold and Simpson (1997) for 
German vowels, shown in Section 2.2, the limits would have to be 
set even on lower frequencies: LGE) of /i/ corresponds to 165 Hz for 
women (medium-sized vocal tracts) and to 145 Hz for men (long vocal 
tracts), respectively; F1 of /i/ corresponds to 329 Hz for women and to 
290 Hz for men or long vocal tracts, respectively. 


In this context, attention should also be given to the fact that, accord- 
ing to several formant statistics, the frequency distance between F1 
and F2 for sounds of some back vowels is given < 500 Hz. Thus, the 
frequency spacing of the first two harmonics in a spectrum of a sound 
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on a fundamental frequency above this frequency limit exceeds the 
F1-F2 distance mentioned, which renders formant estimation obsolete 
within the existing theoretical framework. 


The first lists of frequency limits given above for EO > SF) suggests that 
methodologically speaking the analysis of vowel sounds of children 
and women must be considered problematic in general. The critical 
fundamental frequency value mentioned for children is considera- 
bly lower than the empirically determined average fundamental fre- 
quency that children exhibit when producing vowels in citation-form 
words, which can be considered as related to relaxed speech on a 
comparatively low fundamental frequency (see, for example, the sta- 
tistics in Section 2.1). Thus, most vowel sounds produced by children 
in their everyday expression, exhibit substantially higher fundamen- 
tal frequencies. — According to Hillenbrand et al. (1995), the mentioned 
critical fundamental frequency value for women corresponds to the 
average fundamental frequency of women producing vowels in ci- 
tation-form words. In everyday speech, however, vowel sounds in a 
fundamental frequency range of up to one octave higher than this val- 
ue are the norm. Moreover, according to Patzold and Simpson (1997), 
the mentioned critical fundamental frequency value for women is again 
considerably lower than the average fundamental frequency generally 
given in vowel statistics. — The problem discussed here seems to be 
less pronounced among men than among women and children, but it 
nevertheless concerns a substantial part of their utterances. 


The second list of frequency limits reveals that, for methodological rea- 
sons, any determination of formant patterns of vowel sounds exhibit- 
ing fundamental frequencies that exceed low first-formant frequencies 
does not make sense, since general rules for formant estimation can 
no longer be formulated. In this regard, particular consideration needs 
to be given to voices exhibiting extensive prosodic variations in fun- 
damental frequency, which can be experienced in everyday speech 
and, very pronounced, in the field of art and entertainment. (Notice- 
able, with regard to everyday speech, the literature does not provide 
ample documentation of the occurrence and significance of such ex- 
tensive variation in fundamental frequency, allowing for a validation of 
the significance of the methodological problem of formant estimation 
discussed here. However, in the Materials section, examples of corre- 
sponding utterances are documented; see Section M8.2.) 
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Within the prevailing theoretical framework, the reliability of formant 
analysis depends on fundamental frequency and the age- and gen- 
der-related speaker group, that is, vocal-tract size. Depending on the 
latter, formant frequency estimation becomes critical for fundamen- 
tal frequencies above c.175Hz, and formant frequency estimation 
can no longer be methodologically substantiated for fundamental fre- 
quencies substantially above 350 Hz. Consequently, formant analysis 
cannot be applied to all cases of clearly intelligible vowel sounds. 


A part of the literature tends to equate the methodological problem with 
a particular characteristic of vowel perception, which leads us back 
to the two assumptions discussed in Sections 4.1 and 5.1: firstly, that 
vowels produced by children and women are basically less intelligible 
than those produced by men; and secondly, that at least some vowels 
of sounds at a fundamental frequency substantially above 350Hz can 
no longer be clearly distinguished. As suggested, however, both as- 
sumptions contradict actual vowel perception. 


6.5 Addition: Parameter Adjustments in Formant Analysis and 
Inconsistent References to Vocal-Tract Size 


On the one hand, formant parameters in current procedures of formant 
analysis are defined prior to analysis of the sounds depending on the 
corresponding speaker group, that is, the assumed average vocal-tract 
size of the speakers. On the other hand, these parameter settings are 
sometimes interactively altered during the procedure if the calculated 
numerical values do not yield the expected number of formants in the 
expected vowel-specific frequency ranges compared to the respective 
spectrogram. 


Thus, for example, with regard to sounds of a single speaker, LPC ana- 
lysis involving standard parameters according to the related speaker 
group (average vocal-tract size) may yield the expected values for only 
a part of the sounds, whereas the analysis of other sounds may require 
the parameters to be set to the standard of another speaker group (av- 
erage vocal-tract size) or to a setting that is entirely different from any 
speaker-group related standard given in the literature. 


This reveals an inconsistency in how parameter settings are estab- 
lished: in the first instance, default settings of analytical parameters are 
related to specific vocal-tract sizes, whereas any corrections of these 
settings are related to the respective general (not vocal tract related) 
degree of “formant resolution” of the analysis. 
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6.6 Addition: Spectrum, Formant Pattern, Resynthesis 


As explained in Section 6.1, current methods of analysis yield no con- 
sistent and direct relationship between spectrum, spectral envelope 
and formant frequencies. Consequently, this raises the question of the 
existence of a general relationship between a natural vowel sound, the 
determined formant pattern and resynthesis. 


Currently, resynthesis is indeed being used to examine the reliability of 
calculated formant patterns. However, this kind of verification is unable 
to substantially relativise the general problems of the existing meth- 
ods of analysis described above: resynthesis is feasible only if formant 
analysis is not fundamentally at issue and only with regard to a limited 
variation of analytical parameters. 


Moreover, the question of resynthesis must be discussed against the 
background of synthesised sounds as discussed in Section 3.1, in- 
dicating the possibility of substantial differences in formant patterns 
of sounds of one vowel: if a certain analytically determined formant 
pattern used in a resynthesis reveals an “expected” vowel identity in 
a perceptual test, then this does not mean that another determined 
formant pattern, based on a different parameter setting, and applied in 
a second resynthesis, in principle cannot reveal the same vowel iden- 
tity. Further, the possibility cannot be excluded that there are cases of 
sounds for which, with regard to the perceived vowel quality, based on 
“unexpected” formant patterns may produce a better approximation to 
the quality of the natural sounds in question than based on “expected” 
formant patterns. 


6.7 Addition: Formant Analysis and Objectivity with Regard 
to Synthesised Vowel Sounds 


It is noteworthy that, if a sound is synthesised using a specific pattern 
of filters and filter bandwidths, the formant pattern of a subsequent 
analysis may differ from the synthesis filters if the number of filters 
used is not communicated to the scholar conducting the analysis. 


Moreover, the problem of possible differences of filters used in synthe- 
sis and formant patterns obtained in analysis will be substantially en- 
hanced if the fundamental frequency is varied independent of the filters. 
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6.8 Addition: Formant Patterns and Resynthesis outside 
of the Framework of Prevailing Theory 


It is also noteworthy that, if formant patterns are calculated outside 
the framework of prevailing theory, for example, using LPC analysis as 
a method to decompose any sound into a source and a set of filters, 
irrespective of the fundamental frequency and the perceptual quality 
and not relating the decomposition to existing formant or resonance 
statistics (and therefore not considering a direct relationship between 
spectral peaks and resonances of the vocal tract), and if the results of 
analysis are used in resynthesis, for many examples of natural utter- 
ances, resynthesis reproduces similar intelligible vowel qualities, even 
for very high fundamental frequencies. Obviously, then, formant pat- 
terns will sometimes deviate strongly from the statistical patterns given 
in the literature. 
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Part III Experiences and Observations 


The third part of the main text formulates several hypotheses about 
the actual relationship between vowel sounds, sound spectra 

and formant patterns. These hypotheses refer to the recordings 
mentioned in the first part of the introduction and to related analyses 
and observations. 
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7  Unsystematic Correspondence between 
Vowels, Patterns of Relative Spectral 
Energy Maxima and Formant Patterns 


7.1 Inconstant Number of Vowel-Specific Relative Spectral 
Energy Maxima and Incongruence of Vowel-Specific 
Formant Patterns 


As discussed in Section 3.1, sounds of back vowels and of /a-a/ can 
exhibit only one relative spectral energy maximum within their vow- 
el-specific frequency range <1.5kHz (< 2kHz for some sounds of /a/), 
in contrast to other sounds of the same vowels, which have two such 
maxima. Consequently, the number of vowel-specific energy maxima 
is inconstant. 


The spectral envelopes and formant patterns of such vowel sounds 
cannot in all cases be interpreted as “formant merging”: examples of 
sound pairs of back vowels can be observed for which both sounds ex- 
hibit the lowest spectral envelope peak at a similar frequency level, but 
only one of them has a pronounced second envelope peak within the 
frequency range mentioned. Then, the first spectral envelope peak of 
both sounds corresponds to the vowel quality in question, whereas the 
second spectral envelope peak may be linked to an additional “colour- 
ing” of that sound. However, it plays a marginal role in vowel perception 
and, in such a case, does not posses vowel-differentiating value. 


For both sounds of such sound pairs, formant analyses using current 
methods may reveal two lower formants. However, calculating F2 for 
the first sound of the respective sound pair mentioned, exhibiting only 
one lower spectral envelope peak, may prove highly contingent on the 
number of filters chosen, above all for sounds of children. In addition, 
its amplitude can be very low and its bandwidth can be very large, that 
is, far beyond reference values as given in the literature. 


With regard to front vowels, the frequency of observable second en- 
velope peaks, and with them also calculated F2, can vary strongly. 
Because of this, there are examples of sound pairs of front vowels 
for which the second envelope peak and calculated F2 of one sound 
approaches or even exceeds the third envelope peak and calculated 
F3 of the other sound. (Such observations in general relate to sounds 
of speakers of different speaker groups, which are produced at similar 
fundamental frequencies. However, this can also be observed for the 
sounds of speakers of the same speaker group.) 
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Thus, it is not possible to designate a standard number of consecutive 
relative spectral energy maxima related to delimited frequency ranges 
that represent any given vowel. The same holds true for formants, al- 
though it is less obvious. There are also formant patterns of sounds of 
single vowels whose reciprocal correspondence of single formants is 
open to discussion. 


The number of vowel-specific relative spectral energy maxima is in- 
constant, and formant patterns are incongruent in some cases. 


7.2 Partial Lack of Manifestation of Vowel-Specific Relative 
Spectral Energy Maxima 


In their vowel-specific range of the spectrum < 1.5 kHz, sounds of back 
vowels and of /a—a/ produced at fundamental frequencies < 350 Hz can 
exhibit series of harmonics with consistent, quasi-identical amplitudes. 
These vowel-specific parts of harmonic spectra seem to be “flat”, lack- 
ing any clearly distinctive relative energy maxima. Of special interest 
in this respect are the sounds of /a, a, 9, o/ in cases where the am- 
plitudes of the first three to five harmonics are not markedly different. 


In their vowel-specific range of the spectrum 1.5 kHz, sounds of front 
vowels produced at fundamental frequencies < 350 Hz can also exhibit 
series of harmonics with consistent, quasi-identical amplitudes. Thus, 
what applies to back vowels and to /a—a/ for their entire vowel-specific 
frequency range also applies to front vowels for the higher part of their 
vowel-specific frequency range. 


In addition, cases of such vowel-specific, “flat” spectral portions also 
exist for sounds produced at fundamental frequencies > 350 kHz, even 
if, in relation to the large frequency spacing of the harmonics, this gen- 
erally remains limited to the sounds of the vowels /i, e, €, a, a/. For 
certain fundamental frequencies of the sounds of /9, 0/, the first two 
harmonics can exhibit equal amplitudes. 


Also worth mentioning in this context are the sounds of back vowels 
and of /a-a/, which exhibit continuously decreasing amplitudes in the 
vowel-specific lower frequency range. In the spectra of these sounds, 
the first harmonic generally forms the actual spectral maximum. 


Thus, the set of problems concerning a formulation of a general re- 
lationship between the perceived vowel quality and its physical rep- 
resentation based on a certain number of relative spectral energy max- 
ima is again extended. 
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Spectral envelope maxima, as described in the literature, are not a 
precondition for the physical representation of vowels. 


The relationship between “flat”, vowel-specific parts of sound spectra 
and calculated formant frequencies using current methods of analysis 
cannot be described in simple and general terms. The same holds true 
for the relationship between continuously decreasing amplitudes of the 
harmonics in the vowel-specific lower frequency range and calculat- 
ed formant patterns. Therefore, the issue is left open to discussion 
here. However, it has to be considered as an additional methodological 
problem of formant analysis. 


7.3 Addition: Resynthesis and Synthesis 


Inconstancy in the number of vowel-specific relative spectral energy 
maxima, possible incongruence of formant patterns and vowel sounds 
with “flat” or decreasing vowel-specific spectrum portions can be rep- 
licated using resynthesis. 


The same also applies to formant patterns or harmonic spectra not 
derived directly from natural vowel sounds. 
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8 Lack of Correspondence between Vowels 
and Patterns of Relative Spectral Energy 
Maxima or Formant Patterns 


8.1 Dependence of Vowel-Specific, Relative Spectral Energy 
Maxima and Lower Formants < 1.5kHz on Fundamental 
Frequency 


If investigated empirically and systematically, it becomes evident that 
the first spectral envelope peak—if it exists—and the first calculated 
formant of vowel sounds often depend on fundamental frequency. 


For a range of fundamental frequencies < 350 Hz for which formant ana- 
lysis is not critical in principle, this dependence is particularly evident 
in the sounds of the vowels /e, e, oi at fundamental frequencies in the 
range of 200 Hz to 350 Hz. 


For a range of fundamental frequencies > 350 Hz, this dependence is, 
above all, indicated in sounds of the vowels /i, y, u/, because the first 
harmonic generally exhibits the highest amplitude; thus, the lowest 
spectral peak rises with increasing fundamental frequency. 


In addition, such a dependence can also be observed for the second 
formant for cases of sounds of back vowels. 


For sounds of /e/ and of /a-a/, however, indications of a dependence 
of F1 on fundamental frequency may prove to be weak and corre- 
sponding observations may require a comparison of sounds with a 
very extended vocal range. 


Moreover, the observation of a dependence of F1 on fundamental fre- 
quency is not only related to frequency ranges of the latter and vowel 
qualities but also to single speakers and their phonation characteris- 
tics, including vocal effort. (Note that marked differences in the vocal 
effort of vowel production have a substantial effect on spectral peaks 
and calculated formant frequencies, and this effect has to be taken 
into account when investigating the relationship between FO, spectral 
peaks and formants.) But although the indications for the dependence 
discussed here prove to be unsystematic, the findings of intelligible 
vowel sounds at fundamental frequencies > 500 Hz (see next chapter) 
and of formant pattern ambiguity (see Chapter 9) force us to relate 
the lower spectral peaks and the lower formants to fundamental fre- 
quency. 


8.1 Dependence of Vowel-Specific, Relative Spectral Energy Maxima and 59 
Lower Formants < 1.5 kHz on Fundamental Frequency 


The possible relationship between fundamental frequency and higher 
vowel-specific spectral envelope peaks or formants > 1.5 kHz for sounds 
of front vowels is left open here for discussion. 


These assertions hold true for vowel sounds produced by one and the 
same speaker. Thus, they apply to vowels and their physical representa- 
tion. 


In this respect, what is of particular importance is the observation that 
the dependence of lower spectral envelope peaks and lower formants 
<1.5kHz does not represent a phenomenon generally related to “over- 
singing” the first formant of a vowel: most importantly, the shifts of F1 
in the sounds of the vowels /e, a, o/ can already be observed at fun- 
damental frequencies substantially below the corresponding statistical 
values for F1 as given in the literature for sounds produced in cita- 
tion-form words. Moreover, given a range of fundamental frequencies 
of c. 200-350 Hz, the shifts of F1 for the sounds of the vowels /e, a, oi 
are in many cases much more pronounced than for the sounds of the 
vowels /i, y, u/, although, for the former, the literature gives significant- 
ly higher statistical values for F1 than for the latter. 


Also of particular importance—and foreshadowing formant pattern am- 
biguity of vowel sounds (see Chapter 9)—is the observation that, in 
many cases of sounds of a vowel produced by a single speaker, the 
shifts of F1 in relation to fundamental frequency exceed the F1 dif- 
ferences of two neighbouring vowels as given in formant statistics 
for a corresponding speaker group (for speakers with corresponding 
vocal-tract size). In line with this, the shifts mentioned also exceed 
speaker-group differences in F1 for that same vowel as given in the 
format statistics mentioned. 


Vowel-specific relative spectral energy maximas 1.5 kHz (if determi- 
nable) and calculated vowel-specific formant patterns (if methodo- 
logically substantiated) are dependent on fundamental frequency. 


8.2 Vowel Perception at Fundamental Frequencies 
above Statistical Values of the First-Formant Frequency 


Speakers possessing a large vocal range and good phonation and ar- 
ticulation are able to form the sounds of the vowels /i, y, e, Ø, £, a, O, 
u/ in a recognisable and distinguishable way up to a fundamental fre- 
quency of c. 700-800 Hz. Such sounds can be readily experienced up 
to a fundamental frequency of c. 600 Hz because they occur frequently 
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in everyday speech, in particular among children and women. Howev- 
er, these sounds can also be evidenced for men using “falsetto”. 


Speakers possessing excellent vocal abilities are even able to form 
the sounds of the corner vowels /i, a, u/ in a clearly recognisable and 
distinguishable way up to a fundamental frequency of c. 800-1000 Hz. 
(Ongoing research also indicates that other vowels, too, are intelligible 
in this vocal range.) 


Correspondingly, the respective sound spectra exhibit vowel-specific 
differences, even if these have to be described other than in terms 
of spectral envelopes and formant patterns, for example in terms of 
vowel-specific configurations in the levels of the harmonics (see below, 
Sections 13.2 and 13.3). 


Note that a fundamental frequency of 700 Hz lies above the statistical 
F1 values given for sounds of all long German vowels produced by 
women or men, except for /a/ of women. A fundamental frequency of 
800-1000 Hz even lies above the statistical F1 values for all long Ger- 
man vowels, for both women and men (see Section 2.2). 


The vowel quality of sounds produced at fundamental frequencies 
above statistical values of the vowel-related first-formant frequency 
is intelligible in principle. 


The possibility of such vowel production and perception contradicts the 
designation of established, statistically determined formant patterns 
as “vowel-specific” patterns, irrespective of the methodological prob- 
lems of determining envelope peaks and formant frequencies. At the 
same time, vowel perception and discrimination at such high funda- 
mental frequencies confirms that lower spectral energy maxima (if de- 
terminable) and lower formants (if methodically substantiated) depend 
on fundamental frequency. 


The vowel quality of sounds of back vowels and of /a-a/ produced 
at fundamental frequencies >500 Hz can be physically represented 
solely in terms of the first two or three harmonics and their amplitudes. 
This accentuates the basic problem of assuming that relative spectral 
energy maxima, that is, envelope peaks in closely delimited frequency 
ranges, are a pervasive physical characteristic of the sound of a vowel. 


Here, the question of the maximal fundamental frequency up to which 
all vowels of any given language can in principle be produced in a rec- 
ognisable way is left open for discussion. 
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8.3 “Inversions” of Relative Spectral Energy Maxima 
and Minima and “Inverse” Formant Patterns in Sounds 
of Individual Vowels 


Given that spectral envelope peaks < 1.5 kHz (if determinable) depend 
on fundamental frequency, pairs of sounds of a back vowel produced 
at different fundamental frequencies can exhibit “inverse” relative spec- 
tral maxima and minima in the form of “inverse” spectral envelope 
curves < 1.5kHz without any change in vowel perception: whereas we 
see a relative minimum in the spectrum for one sound, we may observe 
a spectral maximum for the other, and vice versa. The same holds true 
for comparisons between the respective calculated filter curves and 
formant patterns (if methodologically substantiated): where for one 
sound, the filter curve exhibits a relative minimum, for another sound, 
the curve may exhibit a maximum, and vice versa. 


In the case of some front vowels, such “inversions” can also be ob- 
served for the higher vowel-specific frequency range, even if the ques- 
tion of the relationship between such “inversions” and fundamental 
frequency variation is left open here. 


This observation reaffirms the lack of a general correspondence be- 
tween vowels, vowel-specific spectral envelope curves and corre- 
sponding formant patterns. 


With regard to vowel-specific frequency ranges, the spectral enve- 
lope curves of two sounds of the same vowel produced at two dif- 
ferent fundamental frequencies can exhibit “inverse” behaviour. The 
same holds true for formant patterns. 


8.4 Addition: Whispered Vowel Sounds, Fundamental-Frequency 
Dependence of Vowel-Specific Spectral Characteristics 
and “Inversions” 


As discussed in Section 5.5, formant statistics indicate increased vow- 
el-specific formant frequencies F1 and F2 for whispered sounds when 
compared to voiced sounds. However, according to the corresponding 
recording procedures of the comparative investigations, this only ap- 
plies to the lower range of fundamental frequency of the voiced sounds 
produced in citation-form words, comparable to relaxed speech in an 
enclosed space. 


Given that a whispered sound exhibits higher first and second formants 
than a voiced sound of the same vowel and given that the latter’s fun- 
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damental frequency is gradually increased during its production, then 
in many cases it is possible to determine a certain fundamental fre- 
quency for which F1 and F2 of the whispered and voiced sound corre- 
spond with each other. 


Whether this represents an actual rule is left open here. 


If the fundamental frequency of a voiced sound is increased further, 
then there will be cases in which F1 or F1-F2 of the whispered sound 
are lower than F1 or F1-F2 of the voiced sound. 


In any event, the general statement that whispered sounds exhibit fun- 
damentally higher vowel-specific formant patterns than voiced sounds 
does not apply. 


Over the course of such experimentation, cases involving comparisons 
between whispered and voiced sounds exhibiting the described “in- 
versions” may also be found. 


8.5 Addition: Resynthesis and Synthesis 


All the above aspects of the lack of correspondence between vowels 
and patterns of relative spectral energy maxima or formant patterns, 
discussed in relation to natural vowel sounds, can be evaluated and 
replicated using resynthesis. 


The same holds true for resynthesis at fundamental frequencies > 350 Hz 
related directly to the harmonic spectra of natural vowel sounds. 


The same also applies to synthesis involving formant patterns or har- 
monic spectra not derived directly from natural vowel sounds. 
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9 Ambiguous Correspondence between 
Vowels and Patterns of Relative Spectral 
Energy Maxima or Formant Patterns 
or Complete Spectral Envelopes 


9.1 Ambiguous Patterns of Relative Spectral Energy Maxima 
and Ambiguous Formant Patterns 


All these reflections and observations come down to the conjecture 
that two sounds of two different vowels, produced at two different 
fundamental frequencies, can exhibit quasi-identical relative spectral 
energy maxima and quasi-identical formant patterns within their vow- 
el-specific frequency range. Indeed, many patterns of spectral enve- 
lope peaks and formants prove to be ambiguous empirically. As such, 
they often physically represent two (or even several) different vowels. 


In many cases the patterns of relative spectral energy maxima do 
not prove to be vowel specific, but ambiguous. The same holds true 
for formant patterns. 


This observation becomes particularly evident when comparing vow- 
el sounds for their entire range of fundamental frequencies for which 
vowels are recognisable and distinguishable and when evaluating the 
correspondences between relative spectral energy maxima and mini- 
ma also in a direct comparison of harmonic spectra, aside from deter- 
mining spectral envelopes and formant frequencies. 


9.2 Ambiguous Spectral Envelopes 


In certain cases, this ambiguity also concerns the entire course of the 
spectral envelope. 


Spectral envelopes can be equally ambiguous. 


9.3 Ambiguity and Individual Vowels 


For all German vowels discussed here, there are cases of sounds with 
ambiguous patterns of relative spectral energy maxima or with ambig- 
uous formant patterns within the respective vowel-specific frequency 
ranges. 
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To what extent this is also true for complete spectral envelopes is left 
open for discussion. 


If vowel sounds are compared for their entire range of fundamental 
frequencies for which vowels are recognisable and distinguishable and 
if a possible correspondence of relative spectral energy maxima and 
minima is evaluated in a direct comparison of harmonic spectra, then, 
the above ambiguity can be observed not only for sounds of neigh- 
bouring vowel pairs but also for other sound pairs and sometimes for 
sounds of more than two different vowels. This holds particularly true 
when comparing sounds produced by all of the three age- and gen- 
der-related speaker groups. 


The ambiguity described is not limited to only a part of the vowels or 
to neighbouring vowel pairs, and it can affect more than two vowels 
simultaneously. 


The question of whether there are sounds of certain vowels that ex- 
hibit strict vowel-specific patterns of relative spectral energy maxima 
and strict vowel-specific formant patterns, which cannot be found in 
sounds of any other vowel—for example for sounds of /a/—is left open 
for further discussion. 


9.4 Addition: Resynthesis and Synthesis 


The ambiguity discussed above in relation to natural vowel sounds can 
be evaluated and replicated using resynthesis. 


The same also applies to synthesis involving formant patterns or har- 
monic spectra not derived directly from natural vowel sounds. 
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10 Lack of Correspondence between 
Patterns of Relative Spectral Energy 
Maxima or Formant Patterns and Speaker 
Groups or Vocal-Tract Sizes 


10.1 Similar Patterns of Relative Spectral Maxima and Similar 
Formant Patterns < 1.5kHz for Different Speaker Groups 
or Different Vocal-Tract Sizes 


If sounds of a vowel are produced at equal fundamental frequencies 
by children, women and men, and if these sounds perceptually corre- 
spond with each other not only in terms of their general attribution to 
a vowel quality but also in terms of the respective “vowel-colour” vari- 
ant—which makes for the greatest possible correspondence as regards 
perception—then, empirically, both the relative spectral energy maxi- 
ma (if determinable) and the formant patterns (if methodically substan- 
tiated) often prove to be similar in the lower frequency ranges 1.5 kHz, 
apart from possible differences due to the different parameter settings 
involved in formant analysis. Expected age- and gender-related spec- 
tral differences decrease or disappear if the fundamental frequency of 
the utterances correspond for children, women and men. 


Further, for sounds of back vowels and sounds produced by men at 
higher fundamental frequencies than women, it follows that the sounds 
of men (at higher FO) may exhibit higher relative spectral energy maxi- 
ma (if determinable) and higher F1 or even F1-F2 patterns (if methodi- 
cally substantiated) than the sounds of women (on lower F0), as holds 
true for F1 of front vowels. The same may also occur in a correspond- 
ing comparison of sounds of adults and children. 


No statements are made here on /a-a/ since our observations do net yet 
allow for general formulations for all sounds of /a-a/ (see Section 8.1). 


Thus, the question arises whether the lower range of the vowel spec- 
trum mentioned is indeed dependent on age- and gender-related speak- 
er groups, that is, on vocal-tract size. In the literature, this lower fre- 
quency range is referred to as being entirely vowel specific for all back 
vowels and, concerning F1, vowel specific for all other vowels. 


In any event, the general statement that the sounds produced by chil- 
dren exhibit the highest, the sounds of women intermediate and the 
sounds of men the lowest patterns of vowel-specific relative spectral 
energy maxima and formant frequencies does not apply. 
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Within the frequency range of<1.5kHz, vowel-specific patterns of 
relative spectral energy maxima (if determinable) and formant pat- 
terns (if methodically substantiated) often prove to be empirically in- 
dependent of the age- and gender-related speaker group, that is, 
the vocal-tract size. Given strict perceptual correspondences, then, 
differences refer directly to the differences in fundamental frequency. 


As mentioned, the possible relationship between fundamental frequen- 
cies and higher vowel-specific spectral envelope peaks or formants 
for sounds of front vowels is left open for discussion. In the present 
context, this also concerns the question of whether or not higher fre- 
quency ranges are in principal specific to vocal-tract sizes. 


10.2 The Dichotomy of the Vowel Spectrum 


As mentioned repeatedly, while the dependence of vowel-specific 
spectral characteristics and formants on fundamental frequency for 
the lower frequency range<1.5kHz is easily understandable and re- 
producible empirically, this is not the case for the higher frequency 
range. At the same time, lower spectral ranges and lower formant fre- 
quencies are not generally specific to speaker groups and vocal-tract 
sizes. Whether this is also the case for higher spectral ranges and 
formant frequencies is still in question. Thus, the spectrum of a vowel 
sound needs a twofold rather than a uniform consideration. 


The spectrum of a vowel proves to be dichotomous. 


In this context, with regard to the sounds of front vowels, it is particu- 
larly important to consider that, in certain cases, higher relative spectral 
energy maxima (if determinable) and higher formants (if methodically 
substantiated) >2kHz may be simultaneously related to vowel identi- 
ty and perceived speaker group: differences in this higher frequency 
range can often be observed for sounds of a front vowel produced by 
children, women and men if the speakers form these sounds at similar 
fundamental frequencies, even if there is no such difference found in 
the lower frequency range. 


However, it is left open for further investigation whether this is also 
the case if men imitate so-called “female voices” or if adults imitate 
“children’s voices”. 


10.2 The Dichotomy of the Vowel Spectrum 67 


10.3 Addition: Whispered Vowel Sounds and Speaker Groups or 
Vocal-Tract Sizes 


No results of comparative studies of formant patterns for whispered 
vowel sounds of children, women and men have been published to 
date that have obtained a reference status as is the case for reference 
statistics of voiced vowel sounds referred to in Part Il. However, the 
studies that compare whispered sounds of different speaker groups 
(limited in number and generally not including all vowels of a language) 
refer to corresponding differences between formant patterns. 


Notwithstanding the reflections and comments made so far, these dif- 
ferences can be understood as an indication of a general relationship 
between patterns of relative spectral energy maxima and formant pat- 
terns on the one hand, and speaker groups, that is, average vocal-tract 
sizes on the other, including the lower frequency ranges. 


This aspect and its significance regarding the relationship between 
vowels and related spectral characteristics is left open to discussion 
here and needs to be clarified and discussed elsewhere. 


10.4 Addition: Vowel Imitations by Birds 


Sounds of animals imitating utterances of humans are also of primary im- 
portance in the discussion of vowel sounds, related spectral character- 
istics, formant patterns, perceived speaker groups and vocal-tract sizes. 


Fundamental in this respect is the question of how birds are able to 
imitate human sounds despite lacking the means of phonation and 
articulation—in particular, a corresponding vocal-tract resonance. 


According to our own preliminary examination of vowel imitation by 
common hill myna birds who excel at such mimicry (results unpub- 
lished, although some clear examples are given in the Materials sec- 
tion), we conclude the following: if these birds imitate words, and if 
individual imitated vowel sounds are isolated as sound fragments in a 
way that they possess a quasi-static character in terms of quasi-static 
spectral characteristics (above all, that transitions are excluded), then 
vowel perception and a distinction of such sounds by humans is pos- 
sible. For part of these sound fragments, complete F1-F2-F3 formant 
patterns comparable to patterns given for human sounds can be inter- 
preted. For the remaining fragments, only a partial correspondence in 
formant patterns can be observed. (However, this statement must be 
relativised: strictly speaking, any calculation of vowel-related formant 
patterns of bird sounds is methodically unsubstantiated; see below.) 
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The fact that birds are able to imitate human vowel sounds with vow- 
el-specific spectral characteristics and formant patterns comparable 
to those of humans contradicts, in its turn, a strict correspondence 
between the spectral characteristics of the produced sound and vo- 
cal-tract resonance. The same holds true for a strict correspondence 
between spectral characteristics of the produced sound and vocal- 
tract size. Consequently, any critical investigation and discussion of 
vowels must focus on the possibility that the same sound characteris- 
tics can be produced under substantially different physical and physi- 
ological conditions. 


Besides, if birds are able to mimic human utterances, they must be 
able to perceptually differentiate different vocal sounds. However, their 
perception cannot rely on any sensomotoric and conceptual experi- 
ence of vowel production comparable to the experience of humans. 
Thus, it can be speculated that their perception relies on a more “ab- 
stract” acoustic “form” of the vowel sound. (Such speculation would 
meet the claim that a phenomenological approach to the physical rep- 
resentation of vowels is needed; see Part V.) 


10.5 Addition: Resynthesis and Synthesis 


Again, the lack of a general correspondence between patterns of rel- 
ative spectral energy maxima or formant patterns and speaker groups 
or vocal-tract sizes can be evaluated and replicated using resynthesis 
and synthesis. 
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11 Lack of Correlation between Methodo- 
logical Limitations of Formant Determina- 
tion and Limitations of Vowel Perception 


11.1 Vowel Perception at Fundamental Frequencies >350 Hz 


As discussed in Section 8.2, recognisable vowels can be produced 
at fundamental frequencies substantially exceeding the critical limit 
above which formants can no longer be reliably determined for method- 
ological reasons. 


Vowel perception is maintained for sounds at fundamental frequen- 
cies > 350 Hz. Yet, for these middle and higher fundamental fre- 
quency ranges, formant pattern estimation is questionable for 
methodological reasons. Thus, the methodological limitation of de- 
termining formant patterns of vowel sounds at fundamental frequen- 
cies > 350 Hz does not coincide with impaired vowel intelligibility. 


Consequently, formulating a general theory of the physical representa- 
tion of vowels based on formant patterns proves to be critical due to 
the related methodological limitations. 


11.2 Lack of Correspondence between Methodological Problems 
of Formant Pattern Estimation at Fundamental Frequen- 
cies <350 Hz and Impaired Vowel Perception 


Vowel sounds produced at fundamental frequencies < 350 Hz, for which 
the estimation of formant patterns proves questionable for reasons 
other than fundamental frequency—for instance, if expected relative 
spectral energy maxima are “missing” or if vowel-related parts of a 
spectrum are “flat” —are not less recognisable than vowel sounds for 
which formant pattern estimation may be said to be unproblematic. 


Methodological problems regarding the determination of formant 
patterns of vowel sounds at fundamental frequencies < 350 Hz do 
not necessarily coincide with impaired vowel intelligibility. 
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11.3 Addition: Lack of Methodological Basis of Determining 
Formant Patterns for Vowel Mimicry by Birds 


Given the prevailing methodological standards, strictly speaking, the 
imitation of human vowel sounds by birds cannot be studied in terms 
of formant patterns. As explained in Section 6.3, formant calculation 
requires parameter settings for the frequency range and the maximum 
number of filters used in the analysis in relation to a specific vocal-tract 
size. Birds, however, have no vocal tract comparable to that of hu- 
mans. Hence, it is impossible to determine how many filters should be 
used in analysing a vowel-like sound produced by a bird to determine 
vowel-specific formants. 


Thus, in a first step, comparisons between the utterances of humans 
and birds must be based on a direct comparison of the respective 
spectra and must relate to the interpretation of observable relative 
spectral energy maxima. However, in a subsequent step, formant ana- 
lysis double-checked by resynthesis may be applied even if methodi- 
cally unsubstantiated, in order to foster the discussion. 


Again, this methodological limitation of mimicry analysis does not co- 
incide with a principal difficulty to identify the imitated vowel sounds 
involved. 
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Part IV Falsification 


The fourth part of the main text explains why the reflections, 
experiences and observations compiled here falsify prevailing theory. 
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12 Empirical Falsification despite Methodo- 
logical Limitations of Determining Patterns 
of Relative Spectral Envelope Maxima 
or Formant Patterns 


12.1 Lack of Methodological Basis for Verifying Prevailing Theory 


Concerning isolated vowel sounds exhibiting quasi-static spectral char- 
acteristics and allowing for clear perceptual vowel recognition and dis- 
tinction, it is not possible, in a particular language, to formulate general 
rules for determining patterns of relative spectral energy maxima or of 
formant patterns which consistently correspond to the perceived vow- 
el quality of the sounds. 


Consequently, it is not possible to gather general statistical data on 
vowel-specific formant frequencies of recognisable vowel sounds re- 
ferring to the entire realm of utterances. 


Prevailing theory cannot be verified for methodological reasons. 


From a methodological perspective, prevailing theory, thus, is not en- 
dowed with adequate analytical instruments for capturing and describ- 
ing the phenomenon of the vowel. 


Existing references regarding formant statistics do not disclose this 
problem for two reasons: firstly, the investigated speakers are gener- 
ally not subject to a qualitative selection regarding their vocal abilities; 
secondly, such statistics generally exclude any systematic and exten- 
sive variation of fundamental frequency. Both factors, however, are es- 
sential prerequisites for studying the possible fundamental frequency 
ranges of intelligible vowel sounds and for examining the appropriate- 
ness of the methods of acoustic analysis with regard to the entire realm 
of utterances. (Moreover, qualitative speaker selection also allows for 
the study of other important aspects of vowel-sound variation, above 
all variation of vocal effort, register and phonation type.) 
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12.2 Systematic Divergence of Empirical Findings 
from Predictions of Prevailing Theory 


If lower relative spectral energy maxima can be determined and if cor- 
respondent formant frequency calculation can be methodically sub- 
stantiated, in most cases, the corresponding patterns < 1.5 kHz, that 
is the lower frequency range of the spectra, prove to be dependent 
on the fundamental frequency of the sounds relative to the recognised 
vowel. 


Speakers of a given speech community, despite having different vo- 
cal-tract sizes and thus belonging to different speaker groups, are nev- 
ertheless able to produce the sounds of one and the same vowel at 
quasi-identical fundamental frequencies and with quasi-identical lower 
formant frequencies < 1.5kHz. Moreover, speakers with comparatively 
larger vocal-tract sizes can produce sounds of some vowels at higher 
fundamental frequencies and with higher F1 or even higher F1-F2 val- 
ues than speakers with comparably smaller vocal-tract sizes. 


These empirical findings are reciprocally related. They diverge sys- 
tematically from both the predicted independence of vowel-specific 
formant patterns on fundamental frequency and the predicted perva- 
sive dependence of vowel-specific formant patterns on speaker-group 
or vocal-tract size, respectively. 


Empirical findings diverge systematically from the predictions of pre- 
vailing theory. 


From an empirical perspective, prevailing theory thus proves to be in- 
adequate. 


12.3 Empirical Findings Directly Contradicting Prevailing Theory 


A single speaker may not only occasionally produce different isolat- 
ed sounds of different vowels exhibiting the same formant patterns 
F1-F2 or F1-F2-F3 but, for some vowel qualities, this formant pattern 
ambiguity of vowel sounds in relation to the perceived vowel quality is 
systematic if the entire range of fundamental frequency of intelligible 
vowel sounds is investigated. In these cases of ambiguity, speakers 
cannot substantially vary fundamental frequency, maintain vowel qual- 
ity and also maintain formant patterns: if the speaker maintains the 
vowel quality, the formant pattern will alter, or if the formant pattern is 
kept constant, the vowel quality will change. 
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This observation also holds true for patterns of spectral energy maxi- 
ma. Moreover, in some cases, as mentioned, even the entire interpret- 
able spectral envelope proves to be ambiguous. 


Empirical findings can directly contradict the predictions of prevail- 
ing theory. 


Consequently, prevailing theory is falsified because, for a substantial 
portion of vowel sounds, the opposite of what the theory claims to be 
true actually applies: in many cases, given a variation of fundamental 
frequency, vowel sounds with very different formant patterns allow for 
a perception of the same vowel quality, while vowel sounds with similar 
formant patterns allow for a perception of different vowel qualities. 
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Part V Commentary 


The fifth part of the main text discusses the resulting state of affairs 
and points to the need to devise a phenomenology and to develop 
a new theory. 


TI 


13 Preliminaries 


13.1 Impediments to Adjusting Prevailing Theory 


In response to the principal difficulties in intellectually re-enacting the 
prevailing theory of the acoustics of the vowel and in response to the 
empirical observations discussed in the previous chapters, there are 
several arguments against adjusting or modifying prevailing theory and 
the corresponding methods of acoustic analysis. 


According to prevailing theory, formant patterns are deduced from 
patterns of vocal-tract resonances. The formulation of a substantial 
interrelation between these resonances and fundamental frequency in 
the production of vowel sounds would directly contradict the two-part 
model of source and filter and the corresponding understanding of 
phonation and articulation, namely, the production of a general source 
sound and its transformation by vocal-tract resonances. Fundamental 
frequency is a primary characteristic of the source, and resonances are 
a primary characteristic of the vocal tract. These resonances are inde- 
pendent of the sounds or noises affecting them. (Interactions of source 
and filter, as described in the literature, do not relate to the aspects 
discussed here.) This amounts to a fundamental conceptual obstacle 
when it comes to differentiating or modifying prevailing theory. 


Current methods of formant analysis neglect fundamental frequency as 
a source characteristic in the calculation of filters. There is little scope 
for changing this approach within the existing procedural framework. 


Besides, even if formants are not considered to be directly linked to 
vocal tract resonances, interpreting them solely as results of an analyt- 
ical decomposition of a sound in a source and a set of filters, it proves 
difficult to imagine a corresponding method of acoustic analysis ap- 
plicable to all recognisable sounds and all of the aspects discussed in 
Part Ill. This lack of projection itself impedes the modification of pre- 
vailing methodology. 


The observable behaviour of vowel-specific patterns of relative spec- 
tral energy maxima (if determinable) and of formants (if methodically 
substantiated) cannot be formulated in terms of a general rule, such 
as relating these characteristics to fundamental frequency as a simple 
ratio, whether or not such a ratio is based on an auditory scale. Empir- 
ically, these characteristics prove to be unsystematic: in general, the 
shifts in the spectral envelope peaks and the formants discussed are 
distinctly evident only at fundamental frequencies above c. 200Hz; the 


78 Part V Commentary 


shifts affect the lower spectral frequency ranges and the higher ranges 
differently; thus, the shifts affect the entire vowel-specific frequency 
range of back vowels in a direct way but only affect the vowel-specific 
frequency range of front vowels partly; the shifts relate to vowel quality, 
yet in parallel, they also relate to the frequency levels of the spectral 
envelope peaks or formants in question; in addition, a strong variation 
in vocal effort also affects the frequency location of the spectral enve- 
lope peaks and the calculated formants. 


Because of this lack of systematic empirical evidence and because 
there is no uniform method for analysing vowel-specific acoustic char- 
acteristics, including all utterances allowing for vowel perception, no 
robust basis exists for a further differentiation of the description of the 
vowel-specific spectral characteristics within the prevailing approach 
to relate to patterns of spectral peaks or patterns of formants. 


These reflections, experiences and observations constitute the scep- 
ticism expressed in this treatise about attempting to adjust or modify 
prevailing theory and related methods of further analysis. 


13.2 Prevailing Theory as an Index 


Given that a voiced vowel sound is produced in isolation and that it 
exhibits a quasi-constant periodic spectral characteristic, and given 
its unambiguous perception as belonging to a specific vowel quality 
(related to a particular language), then, its average harmonic spectrum, 
measured for the entire duration of the respective sound, is said to 
be vowel specific: for a frequency range concerning the physical rep- 
resentation of all vowels of the corresponding language, a series of 
harmonics quasi-identical in number, frequencies and levels, can only 
be found for other sounds of the same vowel but not for other sounds 
of any other vowel. Such a statement is formulated in terms of a hy- 
pothesis here. 


The same holds true for corresponding sounds that are isolated from 
a particular syntactic and semantic context and that are analysed ac- 
cordingly as sound fragments. 


Obviously, a direct comparison of harmonic spectra always relates to 
sounds at quasi-identical fundamental frequencies. 


Harmonic spectra, as claimed here, are vowel-specific and, further, may 
also prove to be orthogonal in vowel representation: on their basis, the 
respective sounds are expected to be reproducible without any change 
in the perceived vowel quality. 
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Hence, the fundamental aspects of the problems discussed in the pre- 
vious parts of this treatise can neither be attributed to dynamic pro- 
cesses occurring within a sound nor to the particular characteristics of 
the syntactic and semantic context, nor indeed to special perceptual 
processes. Nor can these aspects be relativised accordingly. On the 
contrary, they constitute an ensemble of individual problems that first 
needs to be explained, just as the physical representation of the vowel 
itself, as a phenomenon, needs to be clarified. 


Given that voiced vowel sounds are compared at similar fundamental 
frequencies, and given that the spectral envelope is determined by 
the amplitude values of the harmonics, obviously, such an envelope 
is also vowel specific. However, concerning spectral envelope peaks, 
no simple statement can be derived if all fundamental frequencies of 
intelligible vowel sounds are considered. 


Given that voiced vowel sounds are compared at similar fundamen- 
tal frequencies, and given a methodological substantiation, it can be 
expected that calculated formant patterns (including formant band- 
widths) may, in most cases, also prove to be vowel specific and that, 
on their basis and not altering fundamental frequency, the respective 
sounds can be reproduced without substantial change in the perceived 
vowel quality. 


Thus, prevailing theory “hints” or “points” at the basic characteristic 
of the physical representation of vowel quality in an indexical manner. 
Prevailing theory proves to be an index of this representation. 


13.3 Excursus: Vowel Quality and Harmonic Spectrum 


To repeat: given that a voiced vowel sound is produced in isolation and 
that it exhibits a quasi-constant periodic spectral characteristic, and 
given its unambiguous perception as belonging to a specific vowel 
quality, then its average harmonic spectrum, measured for the entire 
duration of the respective sound, is said to be vowel specific. For a fre- 
quency range concerning the physical representation of all vowels of a 
language, a series of harmonics quasi-identical in number, frequencies 
and levels can only be found for other sounds of the same vowel but 
not for other sounds of any other vowel. 


At first glance, such a statement seems trivial. But it is not. 


To say that a harmonic spectrum of a vowel sound is specific for the 
perceived vowel quality—given the above conditions for the sounds 
under investigation—is not to say that all sounds of a vowel have very 
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similar spectra of this kind. As shown, large spectral variations can be 
found for the sounds of one vowel, particularly if vocal effort is varied 
during the sound production, if sounds of different speaker groups are 
compared and if different speaking and singing modes and styles, in- 
cluding stage voices, are also considered. 


Therefore, an attempt to directly assess the spectral difference relat- 
ed to a perceptual difference of two vowels simply by calculating an 
average harmonic spectrum for all sounds of one vowel at a given fun- 
damental frequency and comparing it with the similarly averaged har- 
monic spectrum of the other vowel may, in many cases, not result in a 
clear spectral difference, that is, in a frequency limit from which the two 
averaged spectra begin to diverge with no overlap. Exceptions may 
occur at high fundamental frequencies because the perceived vowel 
quality is represented by a greatly reduced number of harmonics. 


Considering both the direct relation between harmonic spectrum and 
perceived vowel quality on the one hand and the observably large vari- 
ation of harmonic spectra for sounds of single vowels on the other, and 
speculating that instead of looking at a static spectral configuration we 
should consider looking at a kind of spectral foreground-background 
relation, another attempt may provide more evidence. 


If the harmonic spectrum of a reference sound of a vowel is compared 
with both the spectra of other sounds of the same vowel and the spec- 
tra of sounds of a second vowel, then there will be a frequency limit 
above which the spectrum of the reference sound diverges from any 
spectrum of the sounds of the second vowel, but not from any spec- 
trum of the sounds of the same vowel. 


More precisely, any single sound of a vowel compared with sounds of 
another vowel (given similar fundamental frequencies of the sounds) is 
assumed to be describable in terms of a relation of maximal spectral 
similarity and subsequent—related—spectral difference: for a (lower) 
frequency range, the harmonic spectrum of the single sound of the first 
vowel of comparison can resemble some other harmonic spectra of 
the second vowel, but if the maximum of this frequency range of possi- 
ble resemblance is reached, its spectrum differs from all the spectra of 
the second vowel sharing the maximal similarity, while still resembling 
some other spectra of the first vowel. 


This principle is taken here as the most conservative but also the most 
promising approach and basis for future research on the acoustics of 
the vowel: it is testable and falsifiable in a fully objective manner for 
all levels of fundamental frequency of comparison, it does not need 
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further differentiations related to speaker groups or vocal effort or 
speaking or singing styles or modes and, therefore, its testing does 
not require any integration of further phonetic knowledge. Moreover, 
it also applies to synthesised sounds which are produced using a har- 
monic synthesiser. Thus, it “hints” or “points” at the basic character- 
istic of the physical representation of vowel quality in a much stronger 
manner than prevailing theory, i.e. it is a stronger index of this rep- 
resentation. 


Moreover, if developed in more detail, it leads to an entire system com- 
prising various possible relations of spectral similarities and related 
spectral differences of sounds of all vowels for a given language. 


Although formulated on the basis of a very extended knowledge of vow- 
el spectra, obviously, these short reflections are but general assump- 
tions open to further clarification and empirical verification or falsifica- 
tion, and even if they can be empirically demonstrated as valid, they 
would still remain a fragmentary and temporary basis in the course of 
reformulating the acoustics of the vowel. Therefore, the drawbacks 
of investigating the harmonic spectrum—above all, the impossibil- 
ity of comparing spectra related to very different fundamental fre- 
quencies directly, and the impossibility of including the analysis of 
vowel sounds not exhibiting quasi-static periodic characteristics of 
the sound wave—are not further discussed. The same applies to the 
limitation of the principle formulated, namely, that it only relates to vow- 
el-specific spectral differences but not to a full determination of vow- 
el-related acoustic characteristics. 


However, for further advances in the investigation of the acoustics of 
the vowel, an assessment of the reliability of every given statement is 
needed, and the possibility of a falsification plays a crucial role in this 
assessment: it is the falsification of a generalised assumption of vow- 
el-related formant patterns that called for this treatise. 


Up to now, concerning the acoustics of the vowel, there are only two 
statements that apply to all vowel sounds: vowel sounds, perceived as 
isolated single sounds, are intelligible and therefore, the vowel quality 
must be physically represented in the corresponding sound wave and 
its characteristics. According to this view, an investigation of the har- 
monic spectra is one of the most promising approaches, even if it is 
limited to quasi-constant voiced vowel sounds. 
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Such a step-by-step procedure will be needed as long as there is no 
objective and orthogonal method to describe the acoustic characteris- 
tics that physically represent the perceived vowel quality, including all 
types of vowel sounds (see also below). However, during this proce- 
dure, a kind of rule-based knowledge will emerge and provide a basis 
for the development of an objective and comprehensive method. 


13.4 “Forefield” 


All of the above leads to the conclusion that, at present, no theory 
of the acoustic representation of the vowel exists. However, empirical 
evidence exists that indicates the possibility of such a theory and that 
will contribute to its development. Thus, it is currently in its preliminary 
stages. 


13.5 Two Approaches 


Prevailing theory is characterised by its explanation and description of 
vowel sounds within a physical model unspecific to speech: all kinds 
of sounds and noises are transformed by filters in the same way, irre- 
spective of whether or not they concern utterances (Speech events). 


One possible way to respond to the difficulties of understanding pre- 
vailing theory in terms of its intellectual re-enactment and to the fact 
that empirical findings can contradict its predictions might be to sup- 
plement the existing source-filter model or to replace that model by 
another physical model external to language and speech. 


Another approach might be to assume that the production and for- 
mation of vocal sounds is speech specific and, based on such a pre- 
mise, to develop a method for describing vowel sounds in form-related 
terms. This second approach assumes that the vowel sound and its 
manifestations elude description within a purely physical model. 


Whether this covers all of the possible approaches is left open for dis- 
cussion here. 


As explained in Section 13.1, there are substantial reasons for scepti- 
cism about the possibility of adjusting prevailing theory and the related 
methods of acoustic analysis. One further and important aspect, in 
addition to the arguments already mentioned, concerns the following 
consideration: it would be possible for humans to produce a vow- 
el-unspecific source sound and transform that sound using vocal-tract 
resonances, thereby producing the respective vowel-specific physical 
characteristics according to which listeners perceive vowels, both un- 
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ambiguously and independently of fundamental frequency. But it be- 
longs to the actual acoustic phenomenon of the vowel sound to sys- 
tematically deviate from this. The empirical evidence for vowel sounds 
suggests that humans do not produce such sounds as systematically 
as physics and physiology seem to predetermine. This contrast might 
prove fundamental for future theory building. 


Elsewhere, the author has formulated the state of affairs as follows: (i) 
either resonances as such, and thus the corresponding pharyngeal, 
oral and nasal resonance patterns of the vocal tract, fail to represent in 
full the physical quantity to which language and speech directly refer, 
but another physical quantity can be found instead; if this is the case, 
then itis simply a matter of replacing the existing (physical) model with 
another rather than adopting a fundamentally different perspective; (ii) 
or, aside of the human voice, no construction, no instrument and no 
process can be found to exist in physics that would explain and allow 
for the production of vowel sounds including basic variations of sound 
characteristics, for example, fundamental frequency and phonation 
type; then, the physical representation of human voice cannot be re- 
lated to a simple voice-independent physical quantity, but instead, the 
voice would produce a “substance” or “quantity”. 


Based on all the reflections, experiences and observations presented, 
this treatise belongs to the second kind of undertaking. This calls for a 
corresponding phenomenology and for theory building. 


13.6 Phenomenology 


On the one hand, the existing documentation of vowel sounds hitherto 
published is no more than fragmentary and on the other, the methods 
for describing their acoustic characteristics have substantial short- 
comings and limitations. Thus, as argued above, a phenomenology is 
needed, that is, a step by step build-up of systematic compilations of 
vowel sounds related to individual languages, including the variation 
of all relevant production parameters. In its course, attempts for describ- 
ing acoustic characteristics related to vowel qualities in terms of knowl- 
edge-based rules will become possible (see above). 


In the first instance, such a phenomenology refers to the vowel sounds 
of a particular language, produced in isolation or detached from sound 
context, exhibiting quasi-constant spectral characteristics and allow- 
ing for high scores of vowel identification in listening tests, involving 
listeners of the speech community of that particular language. 
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13.7 Theory Building 


As said, vowel sounds perceived as isolated single sounds can be in- 
telligible. This fact is central to human voice and speech. With regard 
to such sounds, the psychophysical question rises as to which de- 
scribable physical characteristic or which ensemble of physical char- 
acteristics may be said to represent the perceived vowel qualities. 


Theory building thus faces a threefold challenge. Firstly, it must pro- 
duce a uniform, systematic and orthogonal method to describe vow- 
el-specific acoustic characteristics. Only such a descriptive method 
enables a systematic synthetic reproduction of vowel sounds, based 
on empirically determined characteristics of natural vocalisations, and 
thus the verification of the significance of corresponding analyses. 
Secondly, in relation to the phenomenology discussed, theory building 
must deduce hypotheses that predict the physical representation of 
vowel quality irrespective of the individual cases of the vowel sounds, 
thus extrapolating the phenomenological description. These hypoth- 
eses must satisfy the requirements of verification and falsification on 
the one hand, and be transferable to different languages on the other. 
Thirdly, theory building must seek to explain empirical findings and the 
hypotheses deduced from such findings. 
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Afterword 


This treatise is bound to raise many questions, which are not discussed 
in detail here. Moreover, according to previous experiences regarding 
academic discussions, some of the considerations and arguments 
presented here are likely to be refuted on principle. 


Some major issues discussed in the literature have not been consid- 
ered in depth in this text, so that its main argument could be presented 
in straightforward, general and clear terms. Moreover, in-depth consid- 
eration of the issues mentioned has also been dispensed with because 
they appear in a different light from the perspective adopted here and, 
thus, they need to be discussed in another context than is usually the 
case. Within the following exemplary comments, however, some indi- 
cations are given. 


Against the background of the present reflections, experiences and 
observations, we conclude that explaining the lacking distinctiveness 
of expected spectral energy maxima in terms of the characteristics of 
auditory perception as formant merging without taking into account 
the entire systematics of empirically observable, vowel-specific spec- 
tral characteristics—in particular their dependence on fundamental 
frequency and the possible ambiguity of spectral envelope peaks and 
of formant patterns—is questionable. 


The same holds true for normalisation attempts with regard to the pre- 
sumed general differences between the vowel-specific formant pat- 
terns among children, women and men: such normalisation attempts 
would have to be approached quite differently if the comparisons of 
the formant patterns of the three speaker groups did not only include 
different but also similar fundamental frequencies of the sounds of all 
groups. 


The same also applies when attempting to generally relate formant 
shifts, which occur when the fundamental frequency for sounds of one 
vowel is raised, to paralinguistic characteristics, in particular vocal 
effort: low- and high-pitched sounds can both be formed loudly and 
softly, and the calculated formant patterns of vowel sounds do not only 
depend on the vowel quality but also on the fundamental frequency 
in principle. Hence, one has to expect the occurrence of ambiguous 
formant patterns for sounds produced with equal vocal effort. Thus, 
we conclude that the shifts of the lower formants with raising funda- 
mental frequency and formant pattern ambiguity as such are not nec- 
essarily related to paralinguistic aspects. 
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To mention one last example, the same also applies when attempt- 
ing to relate, in general terms, formant shifts that occur when raising 
fundamental frequency in singing to formant tuning: evidence given in 
the studies published on this matter does not allow for a conclusion 
of whether the documented observations refer to the idiosyncrasies of 
individual singers, to the stylistic characteristics of a particular singing 
technique or style with changes in vowel quality (possibly caused by 
so-called vowel modification), to vocal effort, or whether the observa- 
tions indeed refer to the fundamental characteristics of vowel sounds. 


However, the tendency in the literature to consider vowel sounds at 
lower fundamental frequencies — with limited frequency variation—to 
be characteristic of speech, and vowel sounds at middle and higher 
fundamental frequencies— with extensive frequency variation—to be 
characteristic of singing, and the tendency to conduct investigations 
based on these assumptions, needs to be refuted in its turn: neither in 
everyday life, nor in the entertainment sector, nor indeed in musical art 
and vocal interpretation is there any such thing as “normal speech” for 
which, in contrast to singing, a single “average” fundamental frequen- 
cy could be statistically determined. 


The corresponding indications in existing formant statistics are not 
representative of experienceable speech and observable acoustic 
characteristics of vowel sounds: they are only representative of sounds 
uttered into a microphone in a small room in a relaxed and quasi-mo- 
notonous manner. (Such a restricted formulation still lacks contextual 
relativisation in terms of a particular language and “culture”.) There is 
no essential difference between the fundamental frequency ranges for 
speaking on the one hand, and for singing on the other, no matter how 
these categories are determined and distinguished in a scientifically 
reasonable way. If one attentively listens to everyday utterances and 
to utterances in theatre and film—nowadays easily accessible due to 
television—, the corresponding experiences make this plain, and both 
fields of experience need to be integrated into a phenomenology of 
vowel acoustics (See the Materials section for corresponding exam- 
ples). 


With this consideration in mind, as mentioned, some of the major as- 
pects discussed and interpreted in the literature appear in a different 
context than is often reflected upon. 


As indicated at the beginning of this afterword, the present critical 
take on the prevailing theory of vowel acoustics must, in turn, prompt 
scepticism, as has already become evident in many scholarly debates, 
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together with the respective counterarguments. This text has attempt- 
ed to take into account these arguments. Additional comments follow 
below. 


Whatever the extraordinary and often surprising role of perception in 
the recognition of speech sounds, this role neither relativises the fact 
that isolated vowel sounds with a quasi-static sound characteristic can 
be intelligible beyond a concrete syntactic and semantic context nor 
that their harmonic spectra are vowel specific. Thus, a psychophysical 
approach to such vowel sounds, that is, a theory of the relationship be- 
tween perceived vowel quality and physical characteristic or ensemble 
of characteristics, must not only be deemed possible but also necess- 
ary. The psychophysics of vowel sounds constitutes the basis for an 
investigation of human voice and speech. 


In particular, however, there has been a lack of a robust empirical, ex- 
tensive, systematic and representative documentation of the aspects 
discussed in this treatise, and this reason is considered paramount here. 


We adopt the viewpoint—and, therefore, have written this text—that 
any attempt to formulate such a theory in terms of formant patterns 
cannot be successful. Consequently, a different approach needs to be 
formulated. 


What kind of explanation could be provided to explain the fact that 
most previous studies of vowel sounds, and thus of voice and speech, 
have not integrated such a line of argument? There seem to be several 
reasons for this shortcoming. In particular, however—and this reason 
is considered paramount here—, there has been a lack of a robust em- 
pirical, extensive, systematic and representative documentation of the 
aspects discussed in this treatise. Thus, as a consequence of the ab- 
sence of such reference documentation, the discussion lacks a bind- 
ing empirical basis any interpretation must account for. At the same 
time, the basis of a formulation of an alternative theory is lacking, too. 


Thus, whereas existing individual values obtained in studies of vowel 
sounds apply to the specific conditions under which these data were 
gathered, the values are often interpreted in terms of a general physical 
representation of the vowel, which is empirically contradicted. Gener- 
alisation is the critical issue at stake. 


To repeat: whereas average formant patterns (as determined statis- 
tically and separately for each of the three age- and gender-related 
speaker groups and related to average fundamental frequencies of re- 
laxed and quasi-monotonous speaking into a microphone in a small, 
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enclosed space) are in general vowel specific, the same does not hold 
true for substantial fundamental frequency variations evident as pro- 
sodic characteristics already in everyday language. Whereas for vow- 
el sounds produced by men and involving a fundamental frequency 
variation of one octave but not exceeding 200Hz by much, formant 
patterns (if methodically substantiated) of vowel sounds in most cases 
appear independently of fundamental frequency, the same does not 
hold true for the vowel sounds produced by the majority of women and 
by almost all children also involving a fundamental frequency variation 
of one octave. Whereas, in sound synthesis, a specific set of filters re- 
lated to a specific fundamental frequency makes it possible to perceive 
a certain vowel quality, in many cases it does not hold true that the 
same vowel quality is perceived if the filter pattern remains constant 
but the fundamental frequency is significantly altered. And so on. 


Because there is a lack of reliable, extensive, systematic and repre- 
sentative empirical references, including the documentation of vari- 
ation of all basic production parameters needed in order to evaluate 
which physical characteristic is related to a single production param- 
eter and which is in general related to vowel quality, and because, in 
many handbooks of phonetics, the acoustic characteristics of vow- 
els are often treated briefly, in generalised and summary accounts yet 
without relativisation and problematisation, the reflections, experienc- 
es and observations reported in this treatise are partly unfamiliar, are 
rarely reconsidered and are in general not integrated when interpreting 
individual findings of other studies. In the first instance, this complicates 
the discussion within phonetics and psychophysics. Beyond this, how- 
ever, attention has to be given to the significance of this lack of rela- 
tivisation and problematisation for other areas of science—not only 
for fields such as speech recognition, speech pathology, audiology, 
or neuropsychology, but also for the investigation of voice as such, 
including philosophy and art, and for voice and speech education and 
training. How are these fields meant to relate to reliable basic knowl- 
edge and understanding of voice and speech production, and how 
are these fields meant to design reliable experiments if the unresolved 
problem of generalising individual measurements is not placed at the 
centre of understanding and investigation? 


Moreover, some scholars are fundamentally critical of basing the psy- 
chophysics of the vowel on isolated vowel sounds and they question 
the recognisability and the linguistic function of such sounds. This crit- 
ical position generally relates to a linguistic definition of the vowel as a 
vocoid and as syllabic. However, the previous reflections have shown 
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that this treatise does not concur with the resulting notion of a fun- 
damental opposition between isolated versus context-bound sounds, 
static versus dynamic spectral processes and “functionless” utter- 
ances versus those with a linguistic function. Much could be said in 
response both to such an opposition and a critical take on the psycho- 
physics of isolated vowel sounds. Within the limited scope of the pres- 
ent study, however, only a few aspects can be mentioned (the problem 
as such is a matter for future debate and research): 


As said in the introduction, we take the stand that the recog- 
nisability of vowels (monophthongs) as single speech sounds 
perceived in isolation by listeners of a given speech community 
belongs to the elementarisation as a basic characteristic of vo- 
cal expression and speech and thus to the aptitude of the latter 
for a phonetic system of writing. Thus, structurally, isolated vow- 
el sounds must be intelligible as such. 

Refuting the fundamental recognisability and the function of iso- 
lated sounds—their function in its broad sense, emotional and 
aesthetic qualities included—is borne out neither by any experi- 
ence of art, vocal interpretation and entertainment, nor by every- 
day experience. (This order of denomination, artistic utterances 
first, everyday utterances last, is chosen to indicate that all phe- 
nomena discussed here may first be experienced in a direct way 
in the arts; then, when familiar with the correspondingly various 
types of possible utterances and expressions, they will contin- 
uously also get one’s attention in everyday utterances; see also 
the corresponding consideration in the introduction.) 

In this respect, it is worth pointing out the central role that is 
played by sustaining vowel sounds, sometimes for as long as 
possible, in musical composition and vocal interpretation—ei- 
ther in isolation or in a sound context and with or without funda- 
mental frequency variation as a melody. The same holds true for 
basic and advanced voice training in the field of interpretation 
and performance. 

In this respect, it is also worth noting the occurrence of vowel 
sounds produced in isolation in vocal expressions such as ex- 
clamations or affirmations. (The German exclamations “Ahhh”, 
“Ohhh”, “Uhhh” and “Ihhh”, to give a paradigmatic example, 
have a different meaning depending on the context of expres- 
sion and the vowels must be understood as such.) 

Dynamic processes are often represented and considered as 
formant transitions. Yet the lack of a general correspondence 
between vowel qualities and related formant patterns and the 
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limited methodological reliability of formant determination—dis- 
cussed in this study in relation to quasi-static sounds—must 
also be linked to dynamic descriptions. Thus, for instance, it is 
not evident how formant transitions for a sound produced at a 
fundamental frequency of approximately 200 Hz are supposed 
to be compared with those of another sound of the same vowel 
but at a fundamental frequency of 500 Hz. 


Furthermore, whereas other scholars have recognised some or all of 
the problems discussed here, they often reproach the present kind of 
fundamental deliberation for not formulating a new theory. Such argu- 
mentation, however, does not correspond to the views and the stance 
of this treatise. If reasoned, well founded and applicable, any criticism 
of prevailing acoustic theory has its own intrinsic value, utterly irre- 
spective of whatever it is that is offered or proposed beyond that crit- 
icism. Above all, it allows for an identification and formulation of chal- 
lenges and, spurred by the need to resolve them, it drives the search 
for a new approach. 


Pursuing a phenomenology and building a new theory requires a con- 
siderable effort along with the appropriate resources. Doing so presup- 
poses that the scholarly community acknowledges the importance of 
such a venture. Any such acknowledgment, however, requires a com- 
prehensible critique of prevailing theory to be advanced, together with 
a reinterpretation of previous empirical findings. 


The author has also written this text because he does not know how 
far-reaching his contribution and that of his research colleagues is to 
the phenomenology and a new theoretical framework. However, two 
attempts are in progress. With regard to phenomenology, a research 
team is currently creating a large corpus of vowel sounds for Standard 
German, produced by children, women and men, including extensive 
variation of basic production parameters and including both untrained 
and trained speakers and singers (See Maurer, n.d.). In this way, we 
attempt to contribute to the creation of a systematic reference basis 
for vowels of single languages. With regard to theory, in a subsequent 
treatise, we will investigate in detail the thesis of vowel-specific har- 
monic spectra. 


To conclude, the general significance of acoustic characteristics of 
vowel sounds should, as indicated, not be regarded as solely the sub- 
ject of phonetics. Above all, it concerns the understanding of the voice 
as such. 
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The voice is currently attracting particular attention in the humanities. 
Deliberations in these fields are directly related to the knowledge and 
experience gained in artistic creation and interpretation, and there is 
a strong emphasis for the need for an interdisciplinary approach. In 
line with such a claim, the research culture in the aesthetics of the 
voice ought to adopt a particular stance toward the acoustics of the 
voice, too: namely, not to only cite phonetics with regard to ex- 
isting descriptions of vocal utterances, but to critically discuss these 
descriptions and link them to considerations and experiences of art, 
interpretation and entertainment. In this context, a call should emerge 
not to take the “Western” perspectives and production styles as the 
starting point of investigation for the acoustics of the voice, but ini- 
tially to consider any vocal expression, habit and style of any cultural 
context as equivalent. In doing so and in facing the diversity of possi- 
ble vocal expressions, at least in the first instance, no classification of 
„normal“ and „differing“ phenomena and no hierarchical order should 
be imposed, but a decided descriptive perspective should be adopted. 
As said, there should be no underestimation or misunderstanding of 
the fact that raising questions regarding voiced speech sounds raises 
questions regarding the voice itself. 


Our vocal cords produce sound. The resonances of the pharyngeal, 
oral and nasal cavities could form its characteristics into a formant 
pattern that always and uniquely represents a vowel physically, and 
thus allows the listener to perceive it accordingly. Empirical investiga- 
tion reveals, however, that the spectral characteristics of vowel sounds 
systematically deviate from such an option. This observation leads to 
the conclusion that, at present, we are but in the preliminary stages of 
understanding the physical representation of the vowel and, thus, its 
materialised form. 
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Materials 


The Materials section contains selected excerpts from the literature 
and presents exemplary series of vowel sounds and related acoustic 
analyses. An extended version of the Materials is also presented in 
digital form online; please refer to: 

http://www. phones-and-phonemes.org/vowels/acoustics/preliminaries 
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Materials Part | 


The first part of the Materials section contains selected excerpts 
from the literature that are related to the first part of the main text. 
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M1 Prevailing Theory 


Vowels 


“Vowel [...]. 1. (also vocoid) In phonetics, a segment whose articulation 
involves no significant obstruction of the airstream, such as [a], [i] or 
[u]. Strictly speaking, a glide such as [j] of [w] may also be regarded as 
a (brief) vowel in this sense. 2. In phonology, a segment which forms 
the nucleus of a syllable. 3. Any letter of the alphabet which, general- 
ly or in a particular case, represents a vowel in sense 2.” (Trask, 1996, 
p. 382) 


“Vocoid [...]. 1. A synonym for vowel in the phonetic sense of that term 
(sense 1), introduced in an effort to remove the ambiguity between the 
phonetic and the phonological sense of ‘vowel’. While possibly useful, 
the term has never become established. Pike (1943). 2. More narrowly, 
a vocoid in sense 1 which is also syllabic: a true vowel, as opposed to 
a glide or approximant. Sense 2: Laver (1994).” (Trask, 1996, p. 378) 


“Vowels and Consonants. Phonetics has traditionally classified the 
segments of speech into two basic varieties which are called vowels 
and consonants. Once again, there has never been a straightforward 
definition of these terms. Early linguists in India also grappled with the 
concepts of vowel, consonant, and syllable around 800 BC, and they 
recognized that the three notions are hopelessly intertwined [...]. The 
definitions used here will be similar to those of the ancient Sanskrit 
scholars, and in fact, the development of modern phonetics in the 
West owes much to the transmission of knowledge in translation from 
the Sanskrit sources. 

A vowel is defined as a ‘vowel-like segment’ (what Pike [...] 
termed a vocoid) that occupies the nucleus of a syllable. A segment is 
considered to be a vocoid when its articulation permits the relatively 
free passage of air through the center of the mouth. This definition is 
also rather loose, but in roughly familiar terms, most segments that are 
at least as open as an English w or y-sound (the latter is transcribed [j] 
in IPA) are vocoids, all others being non-vocoids. A consonant is then 
defined simply as a non-vocoid, no matter what syllable position it oc- 
cupies. This imperfect dichotomy leaves room for a middle category, 
that of the semivowel, which is defined as a vocoid located outside the 
nucleus of a syllable. Semivowels, in spite of being vocoids, are usually 
regarded as a special sort of consonant (often called a ‘glide’) in the 
interests of preserving the consonant-vowel dichotomy. The interplay 
of consonants, vowels, and syllables in the speech stream is given a 
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slightly different (more acoustic) view by Orlikoff and Kahane: ‘Conso- 
nants differ from vowels primarily by the amount of vocal tract con- 
striction employed in their production [...] Speech can be considered 
to be an overlay of consonants on the vocal signal. The dispersion of 
consonants results in an amplitude modulation of the acoustic energy 
that, for the most part, gives rise to our perception of syllables.’” (Fu- 
lop, 2011, pp. 8-9) 


Speech production: source and filter 


“The speech wave is the response of the vocal tract filter systems to 
one or more sound sources. This simple rule, expressed in the termi- 
nology of acoustic and electrical engineering, implies that the speech 
wave may be uniquely specified in terms of source and filter charac- 
teristics. In spite of the technical phrasing it is apparent that this state- 
ment also covers essentials of the phonetician’s concept of speech 
production.” (Fant, 1960, p. 15) 


See also Chapter M4. 


Formants 


“The spectral peaks of the sound spectrum |P(f)| are called formants. 
Referring to Fig. 1.1-2, it may be seen that one such resonance has its 
counterpart in a frequency region of relatively effective transmission 
through the vocal tract. This selective property of | 7(f)| is independent 
of the source. The frequency location of a maximum in |7(f)|, i.e. the 
resonance frequency, is very close to the corresponding maximum in 
spectrum P(f) of the complete sound. Conceptually these should be 
held apart, but in most instances resonance frequency and formant 
frequency may be used synonymously. Thus, for technical applications 
dealing with voiced sounds it is profitable to define formant frequency 
as a property of T(f). 

The basic principle of the theory of voiced sounds is that, to a 
first order of approximation, the filter function is independent of the 
source. The formant peak will thus only accidentally coincide with the 
frequency of a harmonic. The formant frequencies can change only as 
a result of an articulatory change affecting the dimensions of the var- 
ious parts of the vocal tract cavity system and thus the filter function. 
Conversely, but with the limitations implied by the concept of com- 
pensatory forms of articulation, the formant frequencies provide infor- 
mation about the position of the speaker’s articulatory organs. If these 
formant frequencies are held constant and the fundamental frequency 
is raised one octave, the result is ideally that twice as many pulses 
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per second are emitted from the voice organs. The distance between 
adjacent harmonics in the spectrum will be doubled, and the number 
of harmonics up to a certain fixed frequency limit will thus be halved. 
If a specific formant, for instance the first, comes close to the 6th har- 
monic at the lower pitch, it will be the 3rd harmonic that comes closest 
to the same formant in the case of the higher pitch. The concepts 
of formant frequency and harmonic number should not be confused.” 
(Fant, 1960, p. 20) 


See also Chapters M4 and M6. 


Vowel-specific formants 


“Usually vowels can be quite well characterized in terms of the frequen- 
cies of just the first and second formants, but the third formant should 
also be measured for high front vowels and for r-colored vowels.” 
(Ladefoged, 2003, p. 105) 


Age- and gender-specific formants 


“The length of the pharyngeal-oral tract depends on the physical size 
of the speaker. The length affects the frequency locations of all of the 
vowel formants; this fact helps us to predict where the formant peaks 
in the spectrum will appear for men, women, and children. A very sim- 
ple rule relates the frequencies of the formants to the overall length of 
the tract from glottis through lips. The rule for this relation is: 

Length Rule. The average frequencies of the vowel formants 
are inversely proportional to the length of the pharyngeal-oral tract. 
In other words, the longer the tract, the lower are its average formant 
frequencies. 

The neutral vowel formants for the average man, with an oral 
tract 17.5cm in length, are at 500, 1500, 2500 Hz, and so on, with the 
lowest formant at 500 Hz and frequency spacing of 1000Hz between 
all formants. 

An easy way to remember the neutral formant frequencies is to 
think of the odd numbers 1, 3, 5, 7, 9, and so on, because the formant 
frequencies of a uniform tube that is closed at one end and open at 
the other, like the pharyngeal-oral tract, are always odd multiples of 
the frequency of the lowest formant. For example, begin with the basic 
formant frequency, 500 Hz, as the unit or 1; then the formant frequen- 
cies above that are 500x3 = 1500Hz, 500x5 = 2500Hz, and so on. 
This method, calculating the formants above F1 as multiples of F1, 
applies only as a model of a neutral tract shape. 
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The pharyngeal-oral tract length of an infant is approximately half the 
length of that of a man. Therefore, following our Length Rule about 
formant frequency locations, the formants of a neutral-shaped infant 
tract in relation to a man’s would be at frequency locations that are a 
factor of the reciprocal of 12, or twice those of the man. On this basis 
the infant formant locations for a neutral vowel would be as follows: F1 
is 500 x2 = 1000 Hz, F2 is 1500 x 2 = 3000 Hz, F3 is 2500 x 2 = 5000 Hz, 
and so on. 

Following the same procedure, a woman’s vocal tract, on the av- 
erage, is about 15% shorter than that of a man. The ratio correspond- 
ing to this amount of shortening is approximately 5/6. The reciprocal 
of 5/6 is 6/5, which is equal to a factor of 1.20, which, when multiplied 
by the man’s neutral formant frequencies, gives the woman’s values of 
20% higher: F1 is 500 x 1.2 = 600 Hz, F2 is 1500 x 1.2 = 1800Hz, F3 is 
2500 x 1.2 = 3000Hz, and so on. [...] 

The Length Rule tells us approximately where we may find the 
formants for the very young as well as for older, larger persons. How- 
ever, the neutral locations of F1 and F2 for an individual are also affect- 
ed by the length proportions of the vocal tract between the oral and 
pharyngeal cavities (Fant, 1973, Chapter 4). In general, the location and 
spacing of formants F3 and above are more closely correlated with 
length of vocal tract than for F1 and F2. The average locations of F1 
and F2 for an individual are also affected somewhat by language envi- 
ronment and training.” (Pickett, 1999, pp. 38-40) 


See also Chapter M5. 
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M2 Prevailing Empirical References 


Illustration: including radiation factor/radiation impedance 


For a more differentiated graphic illustration, showing a 12db/octave 
slope of the source and a 6dB/octave intensity increase because of the 
radiation impedance, see Ladefoged (1996, p. 104), Figure 7.7 and the 
related comment: “Figure 7.7 shows a source-filter view of the produc- 
tion of a vowel. The spectrum of the glottal pulse is shown on the left 
of the figure. In this case we have taken the vocal folds to be vibrating 
at 100Hz, so the components are at 100Hz intervals. To the right of 
the spectrum is the set of curves specifying the vocal tract response. 
The output of the vocal tract can be regarded as the input to another 
box entitled ’radiation factor,’ which we must now take into account. 
[...] these vibrations [...] inside the mouth [...] are not themselves the 
variations in air pressure that we hear. The air in the vocal tract vibrates 
so that the air particles at the open end between the lips move back- 
ward and forward. It is these movements that start the air outside the 
lips vibrating. The air between the lips acts like a piston, a source of 
sound producing variations in air pressure that radiate out from the 
lips just as the variations in air pressure radiate out from a source of 
sound such as a tuning fork. The movements of this piston of air are 
more effective in causing variations in pressure in the surrounding air 
at some frequencies than others. The higher the frequency, the greater 
the response of the surrounding air to the action of the air vibrating in 
the vocal tract. This effect, which we have termed the ‘radiation factor’ 
(‘radiation impedance’ is the term used in more technical books), can 
be regarded as a kind of filter that boosts the higher frequencies by 
6dB per octave. The curve representing the radiation factor is shown 
above the third box in figure 7.7. 

The output produced at the lips depends on the vocal cord 
source, the filtering action of the vocal tract, and the further modifica- 
tions produced by the radiation factor. Normally the vocal cord source 
is the same for each vowel, apart from variations of pitch. The vocal 
folds may be vibrating at 100Hz, or at 200Hz, as in the examples we 
have been considering, or at any other frequency in the range of the 
human voice. But irrespective of the fundamental frequency, the spec- 
tral slope of the cord pulse will usually be approximately —12 dB per 
octave. The filtering action of the vocal tract will be different for each 
position of the vocal organs, thus producing formants (peaks in the res- 
onance curve) at different frequencies. The spectrum of the waveform 
beyond the lips (shown on the right of figure 7.7) will have peaks in re- 
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gions which depend on the filter characteristics of the vocal tract. The 
general slope of the output spectrum will be influenced by the slope of 
the spectrum of the glottal pulse (—12 dB/octave) and the radiation fac- 
tor (+6 dB/octave). Taken together these two slope factors account for 
a —6dB/octave slope in the output spectrum. The major characteris- 
tics of the output spectrum - the formant peaks — are superimposed on 
this general slope. They are primarily dependent on the filtering charac- 
teristics of the vocal tract.” (Ladefoged, 1996, pp. 104-105) 


Formant statistics by Fant et al. 


With regard to the study of Fant (1959; see Section 2.1, Table 3), see 
also the later study of Fant, Henningsson, and Stalhammar (1969) con- 
cerning statistical formant patterns for long Swedish vowels produced 
by men. 


Formant statistics for Standard German 


Older studies concerning formant patterns of German vowels were 
published by Jørgensen (1969), livonen (1970, 1986), Rausch (1972), 
Wangler (1981), and Ramers (1988). For further indications of formant 
statistics for Standard German, see the online digital version of the 
materials. 


Formant statistics for other languages 


For further indications of formant statistics of other languages, see also 
the online digital version of the materials. 
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Materials Part II 


The second part of the Materials section contains selected excerpts 
from the literature as well further indications and discussions relating 
to the second part of the main text. 
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M3 Vowels and Number of Formants 


Formant merging 


“If you know you are analyzing a low back vowel, don’t be surprised to 
find one thick bar on the spectrogram that really corresponds to two 
formants close together below 1’000 Hz.” (Ladefoged, 2003, p. 114) 


Referring to vocalisations of /9/ as in caught: “When the formants are 
close together [...] neither the wide- nor the narrowband spectrum 
gives a good indication of the formant frequencies. [...] The first two 
formants appear as a single peak below 1’000 Hz. Their frequencies 
cannot be determined from these spectra.” (Ladefoged, 2003, pp. 
119-120) 


Spurious formant 


“Sometimes it is not immediately obvious whether a particularly wide 
band represents one formant or two. Figure 5.8 is a spectrogram of the 
word bud, spoken by a female speaker of Californian English. There is 
a wide band below 1,000 Hz, but is this one formant or two formants 
close together as in Figure 5.7? Noting that there is a clear formant at 
about 1,500 Hz in Figure 5.8, and additional formants higher, we must 
take it that there is only a single formant below 1,000 Hz. It seems that 
there is some kind of extra formant near the first formant, making this 
dark bar wider. From the evidence of this one vowel it is impossible to 
say whether the additional energy is above or below the first formant. 
Further analysis of this speaker’s voice showed that there was often 
energy around the 1,000Hz region, irrespective of the vowel. This spu- 
rious formant is not connected with the vowel quality, but is simply a 
characteristic of the particular speaker's voice. This is a good example 
of the necessity of looking at a representative sample of a speaker’s 
voice before making any measurements of the formants.” (Ladefoged, 
2003, pp. 114-115) 


“Flat” vowel spectra 


“Flat-spectrum stimuli, consisting of many equal-amplitude harmon- 
ics, produce timbre sensations that can depend strongly on the phase 
angles of the individual harmonics. For fundamental frequencies in the 
human pitch range, many realizable timbres have vowel-like perceptu- 
al qualities. This observation suggests the possibility of constructing 
intelligible voiced speech signals that have flat-amplitude spectra.” 
(Schroeder & Strube, 1986) 
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M4 Vowels and Fundamental Frequency 


Independence of formants and fundamental frequency 


“Obviously, formant frequency is independent from the fundamental 
frequency [...] Changes in formant frequency are due to changes in the 
shape of the vocal tract cavity or cavities; changes in pitch frequency 
to stretching of the vocal cords. If the two physiological events are 
independent, so are the acoustic results of each event [...].” (Delattre, 
1958/1980) 


“[...] when a complex wave consists of a damped waveform repeat- 
ed at regular intervals, the component frequencies will always have 
the same relative amplitudes as the corresponding components in 
the continuous spectrum representing the isolated occurrence of the 
damped wave. Consequently, altering the rate at which the vocal folds 
produce pulses will affect the fundamental frequency of the complex 
wave; but it will not alter the formants (the peaks in the spectrum), 
which correspond to the basic frequencies of the damped vibrations 
of the air in the vocal tract. It is in this sense that we may say that the 
formants of a sound are properties of the corresponding mouth shape. 
[...] the formants which characterize a given vowel irrespective of the 
rate at which pulses are produced by the vocal cords [...] 

We saw in Chapter 6 that the pitch of a sound depends mainly 
on the fundamental frequency. Accordingly, when there is a variation in 
the rate at which pulses are produced by the vocal cords, there will be 
a change in the pitch of the sound (although there will be no change in 
the formants, and hence no change in the characteristic vowel quality). 
It is usually possible to alter the pitch of a vowel sound without altering 
its characteristic quality, because each of these factors is controlled 
by a separate physiological mechanism. As we have seen, the pitch 
depends on the action of the vocal cords, and the characteristic quality 
depends largely on the formants, which have certain fixed values for 
each particular shape of the vocal tract.” (Ladefoged, 1996, pp. 98-99) 


See also the citation of Hillenbrand (n.d.) in Chapter M6. 


“Undersampling” the formants I: formants at middle 
and high fundamental frequencies 


“According to the undersampling account of the effects of f0 on vowel 
identifiability, the sparser distribution of harmonics at high fOs yields 
poorer definition of the peaks and valleys in the spectral envelope, 
creating a more ambiguous stimulus.” (Diehl, Lindblom, Hoemeke, & 
Fahey, 1996) 
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“However, in this range of frequency (500 to 1000 Hertz), you could 
not tell apart different vowels anyway, because the harmonics of the 
voice are so far apart that they are not ‘sampling’ the locations of the 
formants enough for you to tell where the formants lie. Therefore oper- 
atic writers only put words intended to be intelligible in the lower part 
of a soprano’s range.” (Moore, 2006, p. 11) 


“Oversinging” the first formant 


“For the U it is also by no means easy to find the pitch of the resonance 
by a fork, as the smallness of the opening makes the resonance weak. 
Another phenomenon has guided me in this case. If | sing the scale 
from c upwards, uttering the vowel U for each note, and taking care to 
keep the quality of the vowel correct, and not allowing it to pass into 
O, | feel the agitation of the air in the mouth, and even on the drums of 
both ears, where it excites a tickling sensation, most powerfully when 
the voice reaches f. As soon as f is passed the quality changes, the 
strong agitation of the air in the mouth and the tickling in the ear cease. 
[...] The resonance of the mouth for U is thus fixed at f with more cer- 
tainty than by means of tuning forks. But we often meet with a U of 
higher resonance, more resembling O, which | will represent by the 
French Ou. Its proper tone may rise as high as f.” (von Helmholtz, 
1885/1954, p. 110; c = 131 Hz, f = 175 Hz, f’ = 349 Hz) 


“Above f’, the characterization of U becomes imperfect even if it is 
closely assimilated to O. But so long as it remains the only vowel of in- 
determinate sound, and the remainder allow of sensible reinforcement 
of their upper partials in certain regions, this negative character will 
distinguish U. On the other hand a soprano voice in the neighbour- 
hood of f” should not be able to clearly distinguish U, O, A; and this 
agrees with my own experience.” (von Helmholtz, 1885/1954, p. 114; 
f” = 699 Hz) 


“It is reasonable to assume [...] that it is impossible to produce recog- 
nizable vowels at musical pitches very much higher than their first 
formants. [...] 

The following table is offered as a practical guide: Vowels start 
seriously losing intelligibility when the fundamental reaches these fre- 
quencies: 

U U Y) 350 cps (roughly middle F) 

(eoø) 450 cps (roughly middle A) 
(e 9 ce) 600 cps (roughly high D) 
(ee aa) 750 cps (roughly high G)” 
(Howie & Delattre, 1962) 
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“[...] only very few correct identifications of isolated vowels can be 
expected when fundamental frequency reaches or exceeds the usual 
first formant of a vowel.” (Hollien, Mendes-Schwartz, & Nielsen, 2000) 


“[...] vowel identifiability is inevitably compromised once f, exceeds R, 
[...]” (Joliveau, Smith, & Wolfe, 2004) 


“We have seen that female singers gain considerably in sound level by 
abandoning the formant frequencies typical of normal speech when 
they sing at high pitches. At the same time, F1 and F2 are decisive to 
vowel quality. This leads to the question of how it is possible to under- 
stand the lyrics of a song when it is performed with the ‘wrong’ F1 and 
F2 values. Both vowel intelligibility and syllable/text intelligibility can be 
expected to be disturbed. This aspect of singing has been studied in 
several investigations. 

As a thought-provoking reminder of the difficulties in arranging 
well-controlled experimental conditions in the past, an experiment 
carried out by the German phonetician Carl Stumpf (1926) may be 
mentioned. He used three singer subjects: a professional opera singer 
and two amateur singers. Each singer sang various vowels at different 
pitches, with their backs turned away from a group of listeners who 
tried to identify the vowels. The vowels that were sung by the profes- 
sional singer were easier to identify. Also, overall, the percentages of 
correct identifications dropped as low as 50% for several vowels sung 
at the pitch of G5 (784 Hz). 

Since then, many investigations have been devoted to intelligibil- 
ity of sung vowels and syllables (see, e.g. Benolken & Swanson, 1990; 
Gregg & Scherer, 2006; Morozov, 1965). Figure 12 gives an overview of 
the results in terms of the highest percentage of correct identifications 
observed in various investigations for the indicated vowels at the indi- 
cated pitches. The graph shows that vowel intelligibility is reasonably 
accurate up to about C5 and then quickly drops with pitch to about 
15% correct identification at the pitch of F5. The only vowel that has 
been observed to be correctly identified more frequently above this 
pitch is /a/. Apart from pitch and register, larynx position also seems 
to affect vowel intelligibility (Gottfried and Chew, 1986; Scotto di Carlo 
and Germain, 1985). 

Smith and Scott (1980) strikingly demonstrated the significance 
of consonants preceding and following a vowel. This is illustrated in 
the same graph. Above the pitch of F5, syllable intelligibility is clearly 
better than vowel intelligibility. Thus, vowels are easier to identify when 
the acoustic signal contains some transitions (Andreas, 2006). Inci- 
dentally, this seems to be a perceptual universal: changing stimuli are 
easier to process than are quasi-stationary stimuli. 
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The difficulties in identifying vowels and syllables sung at high pitches 
would result both from singers’ deviations from the formant frequency 
patterns of normal speech and from the fact that high-pitched vow- 
els contain few partials that are widely distributed over the frequency 
scale, producing a lack of spectral information. 

In addition, a third effect may contribute. Depending on phona- 
tion type, the FO varies in amplitude. At a high pitch, F1 may lie between 
the first and the second partial. Sundberg and Gauffin (1982) presented 
synthesized, sustained vowel sounds in the soprano range and asked 
subjects to identify the vowel. The results showed that an increased 
amplitude of the FO was generally interpreted as a drop in F1.” (Sund- 
berg, 2013, pp. 86-88) 


“Grade” of vowels 


As discussed in Sections 4.1 and 4.2, prevailing theory gives reason 
to assume that a general but also discontinuous relationship exists 
between the intelligibility of vowel sounds and their fundamental fre- 
quency: accordingly, vowel sounds at lower fundamental frequencies 
would, as a rule, be more intelligible than vowel sounds at higher fre- 
quencies, but vowel intelligibility would also depend upon the respec- 
tive relationships between fundamental frequency, harmonic spectrum 
and the vowel-specific formant pattern (as given in formant statistics). 


Concerning the former, consider the following model cases: 


= Comparison of two sounds of /e/ produced by a woman at FO 
of 200 and 400 Hz, related to a common formant pattern F1-F2 
= 600-2000 Hz (compare Section 2.2, the formant statistics for 
Standard German); F1 will be “undersampled” for the sound at 
higher FO, i.e. F1 lying in between the first and the second har- 
monics, whereas for the first sound, the third harmonic matches 
with F1 indicating a “sampled” formant pattern F1-F2 as a bet- 
ter condition for vowel perception. 

= Comparison of two sounds of /9/ produced by a woman at FO 
of 285 and 340Hz, related to a common formant pattern F1-F2 
= 570-1140 Hz (compare Section 2.2, the formant statistics for 
Standard German); F1-F2 will be “undersampled” for the sound 
at higher FO, i.e. F1 lying in between the first and the second, 
and F2 lying in between the third and the fourth harmonics, while 
for the first sound, the second and the fourth harmonics match 
with F1 and F2. 

- And so on. 
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Concerning the latter, consider the following model cases: 


Comparison of two sounds of /i/ produced by a woman at FO 
of 200 and 300 Hz, related to a common formant pattern F1-F2 
= 300-2700 Hz (compare Section 2.1, the formant statistics of 
Peterson and Barney, 1952); F1 and F2 will be “undersampled” 
for the sound at lower FO, with F1 lying in between the first and 
the second, and F2 lying in between the twelfth and the thir- 
teenth harmonics, while for the second sound, the first and the 
ninth harmonics match with F1 and F2 indicating a “sampled” 
formant pattern F1-F2 as a better condition for vowel perception. 
Comparison of two sounds of /a/ produced by a woman at FO 
of 270 and 330Hz, related to a common formant pattern F1- 
F2 = 660-990 Hz (compare Section 2.1, the formant statistics of 
Fant, 1959); F1 and F2 will be “undersampled” for the sound at 
lower FO, i.e. F1 lying in between the second and the third, and 
F2 lying in between the third and the fourth harmonics, while for 
the second sound, the second and the third harmonics match 
with F1 and F2. 

Comparison of two sounds of /u/ produced by a woman at FO 
of 200 and 300 Hz, related to a common formant pattern F1-F2 
= 300-900 Hz; F1 and F2 will be “undersampled” for the sound 
at lower FO, i.e. F1 lying in between the first and the second, and 
F2 lying in between the fourth and the fifth harmonics, while for 
the second sound, the first and the third harmonics match with 
F1 and F2. 

And so on. 


“Undersampling” the formants II: resonances and formants 


If a basic distinction is made between the resonances of the vocal tract 
and the formants of the vowel sound produced, strictly speaking, only 
resonances can be undersampled in the sense of a large frequency 
distance between harmonics and no harmonic matching an existing 
resonance frequency. Formants in their turn are always a result of a 
method of measurement. 
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M5 Formant Patterns and Speaker Groups 


Thesis of age- and gender-related differences in vowel-specific 
format patterns 


“Because of shorter cavity lengths females [...] have larger average 
formant spacings and higher average formant frequencies than males. 
Similar relations hold for children compared with adults [...].” (Fant, 
1960, p. 21) 


“Men, women, and children generally differ with respect to average 
vocal tract length, which is significant for the formant frequencies, as 
we know. For this reason, the same vowel is usually represented by 
different formant frequencies in men, women, and children. 

[...] average formant frequency differences between male and 
female adults are expressed as the percentages by which the three 
lowest formant frequencies of a given vowel in female adults exceed 
those in male adults (Fant, 1975). [...] they vary considerably between 
vowels, particularly for the lowest two formants. [...] these percentage 
differences occur similarly in various languages. The first formant fre- 
quency shows a maximum percentage difference in the open /a:/ vow- 
el of the Italian word caro. The second formant frequency shows high 
values for all front vowels. The difference, averaged over the entire 
set of vowels, amounts to 12%, 17%, and 18% for the three lowest 
formants. Children’s average formant frequencies are about 20% high- 
er than those for female adults, or 32%, 37%, and 38% higher than 
those of male adults. Probably most of these differences are due to 
inequalities in the vocal tract dimensions between the various groups 
of speakers. Thus, younger children tend to have higher formant fre- 
quencies than older children because of their shorter vocal tracts. 

If the proportions of the average female and male vocal tracts 
are compared, one finds that the female vocal tract is not merely a 
small-scale version of the male vocal tract. According to Nordstrom 
(1977), the average mouth length of a female adult is about 85% of that 
of the average male adult, while the female pharynx length is only 77% 
of the corresponding male value. In other words, the average female 
pharynx is much shorter than the average male pharynx, while the av- 
erage difference is smaller with regard to the mouth. 

If one computes the formant frequency differences that would re- 
sult from these dissimilarities in the mouth and pharynx proportions be- 
tween adult males and females, one finds a discrepancy between predic- 
tion and reality; the differences that have been found in the dimensions 
do not explain the actual formant frequency differences, according to 
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Nordstrom (1977). The reason for this is not well understood. The ex- 
istence of sex dialects, or ‘sexolects’, cannot be excluded; it is possi- 
ble that females and males use a slightly different articulation of some 
vowels. The reason may be hidden in the largely unknown processes 
used by our sense of hearing and our brain in order to identify vowels. 

We correctly infer that the actual reasons for the formant fre- 
quency differences between children and adult males and females are 
not understood in every detail. However, it is also interesting to see 
to what extent the voice timbre differences between these groups of 
speakers can be accounted for by the formant frequency differences. 
Colem (1976) has published an interesting investigation on this topic. 
In an experiment in which subjects tried to identify the sex of speakers 
by listening to the voice quality, he found that phonation frequency was 
a much more important factor than formant frequencies as illustrated 
in Figure 5.10; the average of the three lowest formant frequencies 
showed little or no correlation with maleness and femaleness in voice 
timbre. The faint trace of a correlation that appears to exist between 
the average of the three lowest formant frequencies and the perceived 
maleness or femaleness was due to an equally low correlation between 
phonation frequency and this formant frequency average. 

It may be important to these results that the three lowest formant 
frequencies were not separated but were converted into an average in 
this investigation. It is not clear whether such an average catches all of 
the timbral voice differences between the sexes, and it is also possible 
that the results would have come out differently if the fourth formant 
had been included in the average; the higher the formant frequency, 
the more its frequency depends on nonarticulatory factors such as vo- 
cal tract length. 

It seems clear that the perceptually most important difference in 
voice quality between the two sexes depends on phonation frequency 
rather than formant frequencies. The mean phonation frequency dif- 
ference is almost one octave, which is much greater than the formant 
frequency difference. We realize that our brain is quite smart: it is more 
impressed by the great phonation frequency difference than by the 
small formant frequency difference when guessing the sex of a speaker.” 
(Sundberg, 1978) 


Concerning indications of similar formant patterns for sounds of dif- 
ferent vowels produced by speakers of different speaker groups, see, 
for example, the vowel synthesis experiment in Potter and Steinberg 
(1950), and the [e]-[9] ambiguity reported by Fant, Carlson, and Gran- 
stróm (1974). See also the indications of similar F1-F2 for /U/ and 
/u/, and for /A/ and /o/ in the statistics of Hillenbrand et al. (1995), 
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comparing the patterns of women and men, and of children and men, 
respectively. 


Questioning this thesis: von Helmholtz (1885), 
Potter and Steinberg (1950) 


* [...] the proper tones of the cavity of the mouth are nearly independent 
of age and sex. | have in general found the same resonances in men, 
women, and children. The want of space in the oral cavity of women 
and children can be easily replaced by a great closure of the opening, 
which will make the resonance as deep as in the larger oral cavities of 
men.” (von Helmholtz, 1885/1954, p. 105) 


Note that this statement by von Helmholtz stands in contradiction to 
his self-experiment, on the basis of which he concluded a vowel-spe- 
cific resonance for U at 175Hz (see Chapter M2): particularly for the 
speech of children, the fundamental frequency is substantially above 
175 Hz, not allowing for a production of U, if vowel-specific resonances 
are independent of age and gender. 


“Audible Form and Vowel Identification: Form or pattern of the formant 
positions appears to be important in discriminating between sounds. 
One of the first results found was that, for a given vowel sound, the 
actual formant frequency positions for a man’s voice differ markedly 
from those for a woman’s or a child’s voice. To illustrate this difference 
the frequencies of the formants in the vowel sound [ee] as spoken by 
aman, a woman and a child are shown on the left hand side of Fig. 
5 by short horizontal lines designated F1, F2, F3. [...] Listening tests 
indicate that these three sounds are identified as the same vowel. Yet 
the values of the formant frequencies are quite different. Certainly we 
cannot regard a vowel as completely specified by fixed regions of en- 
ergy concentration. [...] 

If we view the formant positions in relation to positions of fun- 
damental frequency, they fall into better alignment. This suggests that 
the fundamental frequency of the voiced sounds might offer a means 
for normalizing the formant positions. However, this seems a dubious 
possibility because the formant positions for a given vowel are prob- 
ably directly related to the dimensions of the vocal cavities and only 
incidentally related to fundamental frequency. For example, whispered 
vowels can be identified readily. Also there may well be cases of high 
fundamental frequency with large vocal cavities, and vice versa, that 
would need to be considered. 
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To obtain preliminary information on the question of how pitch affects 
vowel identification we have synthesized sounds having the same 
formant outlines but different fundamental frequencies. One such case 
is illustrated in Fig. 6. The two upper charts show the spectra for the [ee] 
(had) sounds of Fig. 5, for the adult male and child’s voices. The fun- 
damental frequencies are 109 and 264 cycles respectively. The lower 
chart shows an unnatural spectrum, namely, the adult male’s formant 
outline with a fundamental frequency of 256 cycles, approximating that 
of the child’s voice. This frequency was chosen so that the peaks of 
the formants would not be shifted markedly in position. Sounds corre- 
sponding to the three spectra were synthesized by means of a spec- 
trum generator [...]. 

The first two synthesized sounds were readily identified by ear 
as [ae] sounds. The third sound, however, was neither the man’s nor 
the child's [ae]. It seemed to be somewhere between the child’s [ae] 
and [e]. This phonetic shift may indicate an association between fun- 
damental frequency and formant position. But the shift could also arise 
if the ear assigns different pitch centers or positions to the energy con- 
centrations representing the formants in the upper and lower cases. 

The effects become more pronounced when the back vowels are 
used in such a comparison. Figure 7 shows spectra similar to the ones 
in Fig. 6, except that they are for the [a] (father) sound. 

In this case, the first two sounds were clear [a’s]. The third 
sound was more like a child's [o] (awl) than the [a] (father). Here there 
is also a question of association or actual shift in the ear’s assignment 
of formant position. Still if one considers the bar positions of these 
sounds as illustrated in Fig. 8, there is some support for an association 
of fundamental frequency and formant position. [...] We have seen that 
an increase in fundamental frequency seems to require that both bars 
be raised in frequency position to maintain the identification of a given 
vowel (Fig. 5). Hence, in the case of the [a] sound, the combination of 
adult formants with the child’s fundamental frequency shifts the sound 
toward the [9]. It must be admitted, though, that the association of adult 
formants and child’s fundamental frequency is an unnatural one giving 
sounds that do not correspond to any of the natural sounds.” (Potter 
& Steinberg, 1950) 


Exceptions in existing formant statistics 


Although in formant statistics, the highest frequency values of vow- 
el-specific formants are generally given for children, middle values for 
women and the lowest values for men, exceptions can be found. Some 
examples of such exceptions are listed below, ordered according to 
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TEEL 


vowel quality. Abbreviations used are: = values for the comparison 
of voiced vowel sounds, “**” values for the comparison of whispered 
vowel sounds; “SinSp” = values for the comparison of the sounds of a 
single male and a single female speaker as given in Fant (1959); “Av” 
= average values for a speaker group in the statistics of Fant (1959). 
Examples of single formants or formant patterns for which higher fre- 
quency values are given for men than for women: 


/i/  F1*-F2*-F3* (Fant, 1959, SinSp); F1* (Fant, 1959, A), Ei" (com- 
pare Pols, Tromp, A Plomp, 1973, van Nierop, Pols, A Plomp, 1973) 

/y/  F1* (Fant, 1959, SinSp; marginal difference for F2*), F1* (com- 
pare Pols et al., 1973, Van Nierop et al., 1973 

/e/  F1*-F2* (Fant, 1959, SinSp) 

/e/ F1*-F2*-F3* (Fant, 1959, SinSp); EI" (Fant, 1959, A) 

/e/  F2* (Fant, 1959, SinSp) 

/ee/  F2* (Fant, 1959, A); F2** (Sharifzadeh, McLoughlin, & Russell, 2012) 

/o/ EI" (Sharifzadeh et al., 2012; marginal difference F2**; marginal 
differences also for F1*-F2*) 

/o/ EI" (Fant, 1959, SinSp) 

/o/  F1*-F2* (Fant, 1959, SinSp); F1* (Fant, 1959, A); F1*, F1**-F2** 
(Sharifzadeh et al., 2012) 

/u/ EI" (Fant, 1959, SinSp); F2* (Fant, 1959, A); F1* (compare Pols 
et al., 1973, Van Nierop et al., 1973); F1* (Zee, 2003); F1** (Sha- 
rifzadeh et al., 2012) 


See also Hillenbrand et al. (1995) for slightly higher F1 values of /A/ for 
women than for children. 


“We have argued [...] that for the vowels /u/, /i/ and /y/ as well, F1 can 
be chosen so that its average value is higher for female speakers than 
for male speakers. However, F1 then becomes about equal to 2xFO 
(490 Hz) which is much too high. The data on the vowels /u/, /i/ and 
/y/ do not confirm the usual upward shift of formant frequencies for 
female speakers. We do not suggest that the anomaly for these three 
vowels reflects the actual resonance frequencies of the vocal tract.” 
(van Nierop et al., 1973) 


Zee (2003) found lower F1 for women than for men for the vowel /u/ 
when investigating formant frequencies of Cantonese vowels and com- 
ments his finding as follows: “In any case, it is not clear as to why the 
F1 value for [u] does not follow the general pattern.” 
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“In looking at the ranges for each vowel formant frequency for the male 
and female groups, the overlap between genders was considerable. In 
all cases, the highest formant value for the male group was markedly 
above the lowest formant value for the female group for each formant 
of both vowels. This would suggest that in some individual cases, the 
formants of a male speaker might be the same as, or even higher than, 
the formants of a female speaker.” (Gelfer & Bennett, 2013) 
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M6 Terms of Reference, Methods of Formant 
Estimation 


Terms of reference 


“Formant [...]. A concentration of acoustic energy within a particular 
frequency band, especially in speech. Any given configuration of the 
vocal tract produces resonance, and hence formants, in certain fre- 
quency ranges. During the articulation of a vowel, these formants show 
up prominently in a sound spectrogram as thick dark bars; the three 
lowest of these, known as first, second and third formants (F1, F2 and 
F3) are highly diagnostic, and vowels are distinguished acoustically by 
the positions of these formants.” (Trask, 1996, p. 148) 


“Some refer to a formant as a peak in the acoustic spectrum. In this us- 
age, a formant is an acoustic feature that may or may not be evidence 
of a vocal tract resonance. Others use the term formant to designate 
a resonance, whether or not actual empirical evidence is found for it.” 
(Kent & Read, 2002, p. 24) 


“Resonances, formants and spectral peaks: Unfortunately, the mean- 
ing of the word ‘formant’ has expanded to describe two or three differ- 
ent things. Fant (1960) gives this definition: ‘The spectral peaks of the 
sound spectrum |P(f)| are called formants.’ Resonance frequencies 
are then defined in terms of the gain function T(f) of the tract by ‘The 
frequency location of a maximum in | T(f)|, i.e. the resonance frequency, 
is very close to the corresponding maximum in spectrum | P(f)| of the 
complete sound.’ Fant then writes: ‘Conceptually these should be held 
apart but in most instances resonance frequency and formant frequen- 
cy may be used synonymously.’ Benade (1976) uses a similar definition 
of formant: ‘The peaks that are observed in the spectrum envelope are 
called formants.’ More recently, the acoustical properties of the vocal 
tract are often modelled using an all-pole autoregressive filter (Atal and 
Hanauer, 1971). For many voice researchers, formants now refer to 
the poles of this filter model. To others, formant means the resonance 
frequency of the tract. Finally, many researchers, particularly in the 
broader field of acoustics, retain the original meaning: a broad peak in 
the spectral envelope of a sound (of a voice, musical instrument, room 
etc.). The original meaning of formant is also retained, almost univer- 
sally, when discussing the singers formant and actors formant: these 
terms refer to a peak in the spectral envelope around 3kHz (discussed 
below). As Fant observes, while these uses are often closely related, 
they are conceptually quite distinct. Further, the resonant frequency, 
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the pole of the fitted filter function and the peak spectral maximum 
need not coincide. Moreover, it is now possible to measure resonances 
of the vocal tract quite independently of the voice. Consequently, it is 
sometimes essential to make a clear distinction among a resonance 
frequency (a physical property of the tract), a filter pole (a value derived 
from data processing) and a spectral peak (a property of the sound).” 
(Wolfe, Garnier, & Smith, 2009) 


“Formant is used by James Jeans (1938) to mean the collection of 
harmonics of a note that are augmented by a resonance. 
Formant was defined by Gunnar Fant (1960): ‘The spectral peaks of 
the sound spectrum |P(f)| are called formants’. 
Benade (1976) writes: ‘The peaks that are observed in the spectrum 
envelope are called formants’. 

In its standards for acoustical terminology, the Acoustical Soci- 
ety of America (1994) defines formant thus: “Of a complex sound, a 
range of frequencies in which there is an absolute or relative max- 
imum in the sound spectrum. Unit, hertz (HZ). NOTE-The frequency 
at the maximum is the formant frequency.” (Wolfe, n.d.) 


“Does it matter? For the voice, a resonance at a frequency R(i) gives 
rise to a spectral maximum at frequency F(i) which may produce in 
a filter model a pole at frequency P(i). Usually, the three frequencies 
have similar values. However, as Fant observed, they are conceptually 
distinct. Let’s take some examples: 

- Consider a vocal tract with a resonance at 500 Hz, which is be- 
ing excited by the larynx producing a fundamental frequency of 
1 kHz (near C6, the high C for sopranos). There is no spectral 
maximum at 500 Hz. In this case there is a resonance R1 but no 
corresponding spectral peak F1. Here of course the difference 
does matter. 

= Consider the singers formant or singing formant, a broad band 
of enhanced power noticed in the spectral envelope of classical- 
ly trained male singers (and possible others) in a range. Sund- 
berg (1974) attributes this formant to a clustering of the third, 
fourth and fifth resonances of the vocal tract. Here, where three 
resonances are thought to give rise to one formant, the distinc- 
tion between formant and resonance is important. 

-= Consider a glottal source with a negative spectral slope, input 
to a vocal tract that (including radiation impedance) has a reso- 
nance at R1. The peak in the spectral envelope of the radiated 
sound in this case has a frequency less than R1. In this case, if 
one is estimating the spectral peak from the harmonic spectrum 
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of the output voice, the difference between the two is less than 
the precision of the estimation, so the distinction is usually not 
important. 

- Consider a musical wind instrument, whose bore radiates weak- 
ly below some frequency f, and which is excited by a reed or 
lip valve whose spectral envelope falls with frequency. Here the 
output sound has a spectral envelope peak that has nothing at 
all to do with the resonances of the bore. 

— Consider this quote, from Stevens and House (1961): ‘When 
resonant frequencies are sufficiently close, however, they are 
not necessarily identical with the frequencies of the peaks in the 
spectrum. For example, when two resonances with bandwidths 
of about 100 cps are about 100 cps apart, the spectrum enve- 
lope may show only one prominence: the frequency of the peak 
will be somewhere between the two resonant frequencies. In the 
discussion that follows, the levels of the resonances will be de- 
fined to be the levels of the spectral envelope at the frequencies 
of the resonances (rather than at the spectral peaks).’ 


In our laboratory, the distinction is important. We routinely measure the 
resonances independently of the voice (Epps et al, 1997; Dowd et al, 
1997; Joliveau et al, 2004a, b). We are often interested in comparing 
formants and resonances. 

What to do? Our preference would be to retain the original mean- 
ing for the word formant. We prefer to say ‘A resonance at frequency 
Ri gives rise to a formant at frequency Fi. This may be modelled by 
a filter with a pole at frequency Pi’. While acousticians will broadly 
agree with this use, some members of the speech research and mod- 
elling community may not. We therefore suggest that, when discussing 
the voice, the word formant should be defined, to make it clear which 
meaning is intended. In principle, one could consider abandoning the 
word. However ‘broad peak in the spectral envelope’ is a long phrase, 
so it is useful to retain formant for that reason. 

L... 

Whatever your choice of definition, you should make it clear. And, in 
literature and in discussions, prepare for some confusion. For instance, 
some researchers who use formant to mean resonance will also talk 
about ‘formant level’. When such people then talk of ‘formant level’, 
or say that the second formant is 10dB lower than the first, | suspect 
that they refer to the amplitude of a peak in the sound spectrum. In a 
scientific talk, | have heard the sentence: ‘Trained sopranos tune the 
first formant near the note sung, but they usually don’t have a strong 
singer’s formant’. When that speaker said ‘first formant’ he presumably 
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meant ‘first resonance’ and when he said ‘singer’s formant’ he meant 
a spectral peak probably due to two or more resonances. So we have 
the same person using the word in two of its three different meanings 
in the one sentence.” (Wolfe, n.d.) 


“With regard to airway resonances, historical precedence and current 
usage of terminology are also slightly at odds. Joe Wolfe and col- 
leagues suggest that the symbol R be used to stand separate from 
the symbol F for formant (Wolfe, 2014). The distinction is being made 
because a formant was originally defined as a peak in the output spec- 
trum envelope radiated from the mouth (Hermann, 1894, 1895; Rus- 
sell, 1929; Fant, 1960, p. 20). A similar definition appears in the current 
ASA standard of acoustic terminology (Acoustical Society of America, 
2004), namely, that a formant is ‘a range of frequencies in which there 
is absolute or relative maximum in the sound spectrum. The frequency 
at the maximum is the formant frequency.’ As such, a formant involves 
both the source and the filter. However, as speech analysis and syn- 
thesis have progressed in a half century, the definition has not been 
universally maintained. Fant (1960, pp. 20, 53) defined formants as 
the poles of the transfer function of the supraglottal vocal tract, and 
labeled the pole frequencies F1, ..., Fn and their bandwidths B1, .... 
Bn. He was followed in this path by many authors, such as Titze (1994, 
p. 156) or Stevens (1998, p.131). It is noteworthy that Flanagan (1965, 
p. 57) was aware of the dual definition (and possible evolution) by us- 
ing the term ‘formant resonance.’ While Benade (1976) maintained the 
definition of ‘peaks in the spectral envelope of the radiated sound,’ 
Badin and Fant (1984) computed formant frequencies and bandwidths 
on the basis of x-ray area function resonances of the supraglottal vocal 
tract, not peaks in the output spectrum envelope. Story et al. (1996) 
did similar calculations based on magnetic resonance imaging (MRI). 
Differentiation between the formant frequencies and resonance fre- 
quencies of the vocal tract can be found in some papers comparing 
measurements from phonation (formants) to those derived from vocal 
tract impedance measurements or from calculations based on MRI or 
computer tomography (CT) data (resonance frequencies) (e.g., Stoffers 
et al., 2006; Vampola et al., 2013). 

What is relevant here for nomenclature and symbolic notation 
is that the letter R is easily distinguishable from the letter F or f, both 
in speaking and writing. Hence, it is useful as a subscript to separate 
source and filter symbols. Discussion can continue on whether or not 
a formant is a meaningful representation of any particular resonance. 
Some authors describe resonances pertaining to the supraglottal air- 
way only (assuming no coupling to the glottal or subglottal system), 


M6 Terms of Reference, Methods of Formant Estimation 121 


while others describe the net effect of complex interactions of multiple 
resonators above, below, and within the larynx. [...] 

Unfortunately, the common definition between a formant and a 
resonance is yet to be established.” (Titze et al., 2015) 


Note that Titze et al. (2015) propose a new and consistent terminology 
for the frequencies, magnitudes and bandwidths of harmonics, reso- 
nances and formants. 


“Spectrum Envelope: The term spectrum envelope refers to an im- 
aginary smooth line drawn to enclose an amplitude spectrum. Figure 
3-17 shows several examples. This is a rather simple concept that will 
play a very important role in understanding certain aspects of auditory 
perception. For example, we will see that our perception of a percep- 
tual attribute called timbre (also called sound quality) is controlled 
primarily by the shape of the spectrum envelope, and not by the fine 
details of the amplitude spectrum. The examples in Figure 3-17 show 
how differences in spectrum envelope play a role in signaling differenc- 
es in one specific example of timbre called vowel quality (i.e., whether 
a vowel sounds like /i/ vs. /a/ vs. /u/, etc.). For example, panels a and 
b in Figure 3-17 show the vowel /a/ produced at two different funda- 
mental frequencies. (We know that the fundamental frequencies are 
different because one spectrum shows wide harmonic spacing and the 
other shows narrow harmonic spacing.) The fact that the two vowels 
are heard as /a/ despite the difference in fundamental frequency can 
be attributed to the fact that these two signals have similar spectrum 
envelopes. Panels c and d in Figure 3-17 show the spectra of two 
signals with different spectrum envelopes but the same fundamental 
frequency (i.e., with the same harmonic spacing). As we will see in the 
chapter on auditory perception, differences in fundamental frequency 
are perceived as differences in pitch. So, for signals (a) and (b) in Figure 
3-17, the listener will hear the same vowel produced at two different 
pitches. Conversely, for signals (c) and (d) in Figure 3-17, the listener 
will hear two different vowels produced at the same pitch.” (Hillen- 
brand, n.d., pp. 16-17) 


Methods of formant estimation I: general aspects 


“The difficulties involved in measuring formant frequencies have been 
well known since the early days of the spectrograph, and involve errors 
related to (i) the ambiguous definition of the object to be measured, (ii) 
spectral features of the speech wave, (iii) intermodulation distortion, 
(iv) the spectrographic record, and (v) the measuring procedure: 
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— A formant is seen both as a spectral prominence in the speech 
wave and as a filter property of the vocal tract; a definition com- 
prising both components contradicts itself; a definition em- 
bracing just the first component presupposes that the relevant 
information for speech perception is immediately available in the 
speech wave; a definition based on the second part alone is 
production oriented and sees the true formant value as a vocal 
tract pole frequency that is being measured from its (sometimes 
poor) reflection in the speech wave. 

— The resolution of the spectral envelope depends on the inter- 
val between the partials, which is equal to the fundamental fre- 
quency; a spectral peak may be asymmetrical within the formant 
band; individual spectral peaks become less well defined as 
they approach each other or as their bandwidths increase. [...] 


Lindblom’s advice is thus still valid today. It is still necessary to ap- 
ply one’s knowledge and experience of speech production and ex- 
pected envelope shapes to the problem of how to select samples to 
measure and where to look for spectral peaks.” (Wood, 1989, referring 
to Lindblom, 1962) 


“[...] At this point we should remember that an LPC filter lumps to- 
gether several aspects of speech production [...]. An LPC spectrum 
represents not only the formant frequencies due to the resonances of 
the vocal tract but also the effects of the lip radiation and the spectrum 
of the pulse from the vocal folds. Nevertheless, the peaks in the LPC 
spectrum are usually good indicators of the formant frequencies. Prob- 
lems may arise when two formants are close together, in which case 
the spectrum may appear to have only a single peak corresponding 
to both of them, or when one formant has a lower amplitude, so that 
it appears as only a kink in the curve representing another formant. 
These problems lead us to another way of considering LPC analysis. 
It is also possible to analyze an LPC expression so as to deter- 
mine the exact frequencies corresponding to the poles (which, howev- 
er, may not be exactly those of the formants in the vocal tract transfer 
function). For every pair of LPC terms we get a pair of numbers corre- 
sponding to the frequency and the bandwidth of a pole in the filter. We 
know [...] that there will be a formant at 500Hz, 1,500 Hz, 2,500 Hz, and 
so on in a neutral vowel for a speaker with a vocal tract of 17.5 cm. In 
general, for such a speaker there will be one formant for every 1,000 Hz 
interval. So with a 10,000 Hz sample rate and an upper frequency limit 
of 5,000 Hz, we can expect to find five formants. This will require ten 
LPC terms. If we want to allow two further terms to account for higher 
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formants that may be influencing the spectrum or a pole due to the 
glottal pulse shape, then we should make a twelve-point LPC analysis. 
If the speaker might have a shorter vocal tract so that we could only ex- 
pect four formants below 10,000 Hz, then we could use a ten point LPC. 
Choosing the right number of coefficients for an LPC analysis 
is somewhat of an art. If one chooses too many, the analysis will pro- 
duce poles corresponding to spurious formants; if one chooses too 
few, formants may be lumped together because the higher formants or 
the glottal pulse may require more complex specification. The problem 
is compounded by the fact that an LPC analysis is equivalent to trying 
to model the spectrum using only poles, and there may be zeros (an- 
tiresonances) in the vocal tract transfer function. There certainly will be 
antiresonances in any vocal tract shape that contains the equivalent of 
a side tube, such as the oral cavity in the case of a nasal sound. LPC 
analysis is not reliable for nasalized vowels. A general rule of thumb 
for the number of coefficients is the sample rate in kHz plus 2, e.g. 
10,000 Hz = 10 kHz plus 2 equals 12. But a better rule is to use several 
different analyses with different numbers of coefficients and see which 
gives the most interpretable results.” (Ladefoged, 1996, pp. 210-212) 


“Good spectrograms are a great help in determining where the formants 
are. This is often not as easy one might imagine. You have to know 
where to look for formants before you can find them. The best practical 
technique is to look for one formant for every 1,000Hz. The vowel e, 
for example, has formants at about 500, 1,500 and 2,500 Hz for a male 
speaker (all slightly higher for a female speaker). Other vowels will have 
formants up or down from this mid range. But there are exceptions to 
this general rule of one formant per 1,000 Hz. It would be more true to 
say that there is, on average, one formant for every 1,000 Hz. Low back 
vowels may have two formants below 1,000Hz, but nothing between 
1,000 and 2,000 Hz, and then the third formant somewhere between 
2,000 and 3,000 Hz.” (Ladefoged, 2003, pp. 113-114) 


Methods of formant estimation II: methodological limits related to FO 


“[...] in the case of female speech, formant analysis is extremely dif- 
ficult. The fundamental frequency is so high that formants are often 
poorly defined. [...] We had difficulties in determining the position of a 
formant in about 40% of the 300 vowel segments, if no a priori knowl- 
edge was used.” (Van Nierop et al., 1973) 


“[...] because formant frequencies are hard to determine when funda- 
mental frequency is higher than about half of the frequency of the first 
formant.” (Sundberg, 1987, pp. 124-125) 
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“Accurate measurement of formant frequencies is important in many 
studies of speech perception and production. Errors in formant fre- 
quency estimation by eye, using a spectrogram, or automatically, us- 
ing linear prediction, have been reported to be as high as 60Hz at 
FO <300Hz. This exceeds the typical auditory difference limens (DLs) 
for formant frequencies and is also greater than some of the variation 
that one would like to study, e.g. the acoustic effects of varying vocal 
effort. The problem becomes substantially worse when FO is as high 
as 500 to 600 Hz, which is not uncommon in the speech of women and 
children at high vocal efforts.” (Traunmuller & Eriksson, 1997) 


“Measurements of the frequency position of the formants, considered 
as the resonances of the vocal tract, are affected by substantial errors 
when FO is as high as it is when people communicate over large dis- 
tances. This holds for LPC-based methods as well as when using visual 
inspection of spectrograms.” (Traunmiller & Erikkson, 2000) 


“The problem is that it is difficult to determine reliably the resonance 
frequencies of the tract from the sound alone, using either spectral 
analysis or linear prediction, once FO exceeds 350Hz (Monson and 
Engebretson, 1983), and essentially impossible once FO exceeds 
500 Hz.” (Joliveau et al., 2004) 


“[...] it is difficult to determine unambiguously the frequencies of the 
resonances with a resolution much finer than f0/2.” (Swerdlin, Smith, 
& Wolfe, 2010) 


Methods of formant estimation III: “One wonders, for example, 

if the source-filter theory of speech production would have taken 
the same course of development if female voices had been the 
primary model early on.” 


“To a large extent, the early work in acoustic phonetics focused on 
the adult male speaker. There were a number of reasons for this fo- 
cus, including social and technical factors. Only rather recently has the 
study of acoustic phonetics been broadened to encompass significant 
research on populations other than men. This is not to say that children 
and women were neglected altogether in the early history of acoustic 
speech research. Peterson and Barney’s (1952) classic study included 
acoustic data on vowels for men, women and children, making it clear 
that acoustic values vary markedly with age and gender characteristics 
of speakers [...]. 

The problem is that the research effort given to the speech 
of women and children has been on a smaller scale than that given 
to the speech of men. Consequently, there is a continuing need to 
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gather acoustic data for diverse populations. The concentration on 
male speakers had several consequences, not all of which facilitat- 
ed research on the speech of women and children. One consequence 
was the choice of an analyzing bandwidth (300 Hz for the ‘wide-band’ 
analysis) on early spectrographs that worked well enough for most 
adult male voices but was deficient for many women and children. The 
unsuitability of the analyzing bandwidth probably discouraged acous- 
tic analyses of women’s and children’s speech. 

The implications of the male emphasis may have reached even 
to theory; Titze (1989, p. 1699) commented, ‘One wonders, for exam- 
ple, if the source-filter theory of speech production would have taken 
the same course of development if female voices had been the primary 
model early on.’ Klatt and Klatt (1990, p. 820) remarked on the same 
point: ‘informal observations hint at the possibility that vowel spectra 
obtained from women’s voices do not conform as well to an all-pole 
[i.e. all formant] model, due perhaps to tracheal coupling and source/ 
tract interactions.’ The acoustic theory for vowels [...] assumed that the 
vocal tract transfer function is satisfactorily represented by formants 
(poles) and that antiformants (zeros) are required only for modifications 
such as nasalization. It is advisable to bear in mind that this theory is 
predicated largely on the characteristics of adult male speech and that 
it may have to be altered to account for the characteristics of both 
children and women.” (Kent 8 Read, 2002, pp. 189-190) 
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Materials Part Ill 


The third part of the Materials section presents exemplary series 
of vowel sounds and related acoustic analyses linked to the third 
part of the main text, including further indications on previously 
published data. 
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Note on the Method 


Empirical basis 


As mentioned in the introduction, the empirical basis of this treatise — 
and the basis of the series of vowel sounds selected for presentation 
here—consists of recordings from various areas of everyday life, the 
entertainment sector and art, that is, stage voices in music and straight 
theatre. (For an additional investigation of sounds of birds imitating 
human utterances, see Section M10.A.) 


The recordings were collected over a time period of more than 20 years 
with different techniques related to different sound qualities, and they 
represent utterances of speakers different in age and gender, produc- 
ing vowel sounds in different contexts, with different durations and dif- 
ferent vocal efforts. However, such variation is not a shortcoming but 
an intention here, since this treatise focuses on the psychophysical 
question of the vowel (see the introduction and Section 13.7): given 
that different vowel sounds are perceived as being related to a single 
vowel quality —in contrast to the variation of other vocal sound charac- 
teristics—, which describable physical characteristic or which ensem- 
ble of physical characteristics may be said to represent that quality? 


Concerning the acoustic characteristics of vowel sounds, the sound 
examples presented here were produced in isolation or in word context 
by native German or Swiss-German speakers, with a few exceptions, 
and the vowel qualities correspond to Standard German. Because of 
the psychophysical perspective adopted here, and because of the 
large fundamental frequency range considered —including many high- 
pitched vowel sounds produced in isolation or in the context of high- 
pitched speech by untrained children, women and men as well as by 
professional actresses and actors—, no principal difference is made 
between speaking and singing for isolated vowel sounds or extracted 
vowel nuclei and no corresponding indication is given in the figures 
which would relate to a classificatory system of modes of vowel pro- 
duction. — Acoustic analysis as well as perceptual identification relates 
to sounds produced in isolation or extracted as vowel nuclei from words. 


Concerning the acoustic characteristics of pitch contours, the exam- 
ples presented here (See Section 8.2) only concern contours of speech. 
Thereby, they relate to utterances of speakers of different languages 
(see the corresponding figure legends). 
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Whereas one part of these recordings forms the basis of single, pub- 
lished investigations undertaken in the past, which included listening 
tests, another part is unpublished and the corresponding recordings 
have not been subject to any further identification tests, apart from the 
identification by the author: in the course of creating this publication, 
for each of the sound series of a single figure presented in the Mate- 
rials section, the author has evaluated the perceptual vowel quality of 
each sound separately. Moreover, only sounds are presented for which 
the intended and the perceived vowel quality correspond. 


Acoustic analysis 


With regard to the acoustic analysis of the sounds in general and to 
the calculation of fundamental and formant frequencies in particular, 
automatically calculated values using routines from the PRAAT Soft- 
ware (Boersma & Weenink, 2015) related to corresponding standard 
parameters are given in the figures of Chapters 7 to 10. 


Acoustic analysis was conducted on isolated vowel sounds or on ex- 
tracted vowel nuclei and concerned FO, spectrum, formant frequencies 
and LPC curve. (Note that the digital version of the Materials further 
includes pitch contour, spectrogram, formant tracks and comparison 
of three formant patterns and three LPC curves related to the three 
standard parameter settings for children, women and men.) 


For longer vowel sounds, a middle sound fragment of 0.3 s, and for 
shorter sounds, a middle vowel nucleus excluding onset and offset 
was analysed. 


The fundamental frequency of a sound fragment was calculated as 
average value using the Praat command To Pitch. Calculated values 
were perceptually crosschecked. If calculation errors occurred, the pa- 
rameters “pitch floor” and “pitch ceiling” were adjusted. 


The spectrum of a sound fragment was calculated as average spec- 
trum for 0-5.5 KHz. 


The formant frequencies of a sound fragment were automatically calcu- 
lated as average values of LPC analysis using the Praat command To 
Formant (robust), with standard parameters according to the age and/ 
or gender of the speaker and for a frequency range of 0-5.5 KHz. For ill- 
ustration purposes, an LPC curve was calculated related to the analysis 
window in the middle of the sound fragment analysed. 
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Please note: 


= Spectrum and numerical formant frequencies are calculated as 
averaged for the entire sound fragment analysed, but the LPC 
curve is related to a single window in the middle of the fragment. 
As a consequence, for a few sounds, the LPC filter curve does 
not correspond to the vowel spectrum and the numerical formant 
pattern. 

- Because of automatic calculation and averaged values, calculat- 
ed F1 for sounds of /i, y, u/ at middle and high fundamental fre- 
quencies is sometimes given as slightly below FO. In these cases, 
F1 can be estimated as roughly matching FO. 

-= A few of the calculated frequencies of the formants considered 
deviate so strongly from the sound spectrum and its amplitude 
minima and maxima that they are set in parenthesis or have been 
replaced by a rough estimation related to the spectrum. Excep- 
tions are the sounds produced by birds for which the automati- 
cally calculated formant frequencies are given without consider- 
ation of their validity. 


For longer recordings of speech (see Section M8.2), only the pitch con- 
tour was analysed and perceptually crosschecked. If major calcula- 
tion errors occurred, the parameters “pitch floor” and “pitch ceiling” 
were again adjusted. 


Illustrations 


Each figure includes a series of vowel sounds (represented as vowel 
spectra) or examples of speech (represented as pitch contours). The 
subject matter of illustration is explained in the text and indicated in 
short form in the figure legend. 


A vowel spectrum is given as the sound pressure level (SPL) in dB/ 
Hz (y-coordinate) for a frequency range of 0-5500 Hz (x-coordinate). 
If, in the text, a vowel spectrum is considered in relation to calculated 
formants and/or to an LPC curve, this curve is also shown; if not, only 
the spectrum is presented. Below a spectrum, the following indications 
are given in the first line: figure number and number of the spectrum 
in the figure, vowel quality, fundamental frequency (F0), identification 
number of the speaker, gender of the speaker (w=woman/female, 
m=man/male), age group of the speaker (C=children, A=adults; note 
B=birds) and record number (R) of the recording in the database. For 
some figures, depending on the context of consideration, selected 
formant frequencies are indicated in addition in the second line. 
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Since the single vowel spectra relate to single vowel sounds, the vowel 
quality is given in square brackets. Note that in the figures, the vowel 
quality of /a-a/ is represented by the character “a” with no further dif- 
ferenciation. 


Pitch contours of speech are given as the pitch frequency in Hz (y-co- 
ordinate) over a time range in s (x-coordinate). Below a pitch contour, 
the following indications are given in the first line: figure number and 
number of the contour in the figure, [speech] as the mode of vocal 
expression and the content of recording, identification number of the 
speaker, gender and age group of the speaker and record number (R) 
of the recording in the database. In the second line, the overall FO 
range for all contours of a speaker presented in a figure is given. 


Note that the order of sound presentation in relation to vowel qualities 
and to FO is not uniform throughout the entire Materials section; for 
each single section, this order accords to the subject matter illustrated 
and to the choice of the author. 


Digital version of the Materials 


More details on the method and, as mentioned, an extended docu- 
mentation of the results of acoustic analysis are provided in the digital 
version of the Materials at: 

http://www. phones-and-phonemes.org/vowels/acoustics/preliminaries 
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M7 Unsystematic Correspondence between 
Vowels, Patterns of Relative Spectral 
Energy Maxima and Formant Patterns 


M7.1 Inconstant Number of Vowel-Specific Relative Spectral 
Energy Maxima and Incongruence of Vowel-Specific 
Formant Patterns 


Figures 1 to 3 show examples of sounds of the back vowels /u, o/ and 
of /a-a/ exhibiting only one relative spectral energy maximum within 
their vowel-specific frequency range<c. 1.5kHz. Each series corre- 
sponds to sounds produced by speakers of one speaker group (chil- 
dren, women, men). Note that for the sounds of /a-a/, a dominant 
first harmonic is ignored here when interpreting relative spectral energy 
maxima. Note also that the examples 1, 3 and 4 in Figure 1 perceptu- 
ally represent /o/ rather than /a-a/. 


For each of the speaker groups and each of the three vowels in question, 
Figures 4 to 6 show three examples exhibiting two relative spectral en- 
ergy maxima within their vowel-specific frequency range <c. 1.5 kHz, as 
is usually assumed to be the “normal” case for sounds of these vowels. 


Note that the spectra of the sounds of /u, o/ shown in Figures 1 to 3 
cannot be interpreted as a general manifestation of “formant merging”: 
if these spectra are compared with the spectra of the corresponding 
vowel sounds shown in Figures 4 to 6, the lowest spectral envelope 
peaks occur at similar frequency levels, given similar FO. Thus, the first 
spectral envelope peak of all sounds corresponds to the vowel quality 
in question, whereas the second spectral envelope peak for the sounds 
shown in Figures 4 to 6 may be related to an additional sound “colour- 
ing” that, however, does not possess vowel-differentiating value. Figure 
7 illustrates this phenomenon by direct comparison of selected sounds 
of /u, o/ in Figures 1 to 3 with selected sounds of /u, 0/ in Figures 4 to 6. 


Figures 8 and 9 show examples of sound pairs of the vowels /i/ and 
/e/, each pair produced by speakers of one speaker group, for which 
differences in FO and F1 are small but differences in the higher vow- 
el-related spectral parts are substantial, up to F2 of the second sound 
matching or exceeding F3 of the first. Figure 10 shows more sound 
pairs of this kind but, in this case, comparing sounds of children and 
men, in order to document the phenomenon in its very extreme. 


For earlier accounts, see Maurer, Landis, and d’Heureuse (1991), Mau- 
rer and Landis (1995). 
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Sound Pressure Level (dB/Hz) 


Figure 1. Sounds of /a-a, o, u/, produced by children, which exhibit only one relative 
spectral energy maximum within their vowel-specific frequency range <c. 1.5 kHz. 
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(Figure 1, continuation) 


Frequency (Hz) 
100: 100. 
80 80. 80 
60 | 60 | 60 | 
| W 
sof | | | 404,/ \ | 
| 40}, || | V 
20 | 20 Y 20 | 
9 0 i Hv | 0 
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 


1-13 [u] F0=217Hz 62-m-C R23656 


1-14 [u] F0=311Hz 61-m-C R23519 


1-15 [u] F0=344Hz 88-m-C R28257 


100 100: 100 
80. 80. 80 
60 | 60 | 60. | 
JĄ | | 
\ 40 
40 l 40 L/ | 
V | 
20 20 20 
| 
o ww h | mł 
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 


1-16 [u] F0=424Hz 135-m-C R7079 


1-17 [u] F0=507Hz 42-m-C R19066 


100 100. 
80. 80: 
60. | 60. | 
| | 
| \ 1 "kt 
20 | 20. 
| 
i wd x 
0 1000 2000 3000 4000 5000 D 1000 2000 3000 4000 5000 


1-19 [u] FO=736Hz 61-m-C R23622 


134 


1-20 [u] F0=834Hz 38-w-C R18452 


1-18 [u] FO=594Hz 69-m-C R24802 


Materials Part III 


Figure 2. Sounds of /a—a, o, u/, produced by women, which exhibit only one relative 
spectral energy maximum within their vowel-specific frequency range <c. 1.5kHz. 
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(Figure 2, continuation) 


| . 
2 «||| \ ol | 
Avs, Na, NAMA 


2-13 [o] F0=298Hz 376-w-A R48255 


2-14 [o] FO=300Hz 180-w-A R39059 


80 80 
80: 
60. 60. 
60 | | | 
| | 40} | | 40} | 
404 y\ U | 
d 20 y | | | géi | 
20 | 
Ñ | vu " 
0 
-20 -20 
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 


2-16 [u] FO=215Hz 73-w-A R3457 


2-17 [u] FO=237Hz 180-w-A R39090 


2-18 [u] FO=280Hz 1-w-A R7003 


100. 100 100 
80 80. 80 
60 | 60. 60. | 
40 ` 40 | | 
20 H | \ „| | 20. y 
bi INN W : aaa 
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 
2-19 [u] FO=311Hz 1-w-A R10045 2-20 [u] FO=392Hz 24-w-A R15036 2-21 [u] FO=507Hz 6-w-A R10807 
100. 100. 100. 
80 80. 80 
60. | 60 | 60. | 
40 l | 40. | 40. | | | 
Wtyki" laik | NY nae 
0 1000 2000 3000 4000 5000 D 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 


2-22 [u] FO=606Hz 33-w-A R17229 


136 


2-23 [u] FO=721Hz 14-w-A R12887 


2-24 [u] FO=863Hz 53-w-A R21540 


Materials Part Ill 


SPL (dB/Hz) 


Figure 3. Sounds of /a—a, o, u/, produced by men, which exhibit only one relative spec- 
tral energy maximum within their vowel-specific frequency range <c. 1.5kHz. 
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(Figure 3, continuation) 
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Figure 4. Sounds of /a-a, o, u/, produced by children, which exhibit two relative spectral 
energy maxima within their vowel-specific frequency range <c. 1.5kHz. 
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Figure 5. Sounds of /a—a, o, u/, produced by women, which exhibit two relative spectral 
energy maxima within their vowel-specific frequency range<c. 1.5kHz. 
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Figure 6. Sounds of /a—a, o, u/, produced by men, which exhibit two relative spectral 
energy maxima within their vowel-specific frequency range <c. 1.5kHz. 
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and Incongruence of Vowel-Specific Formant Patterns 


Figure 7. Direct comparisons of sounds of back vowels with one or two relative spectral 
energy maxima < c. 1.5kHz. (Sounds of children are selected from Figures 1 and 4, those 


for women from Figures 2 and 5 and those for men from Figures 3 and 6.) 
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(Figure 7, continuation) 
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Figure 8. Sound pairs of /i/, each pair produced by speakers of one and the same age 
and gender-related speaker group, with small differences in FO and F1 but substantial 
differences in the higher vowel-related spectral range. 
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Figure 9. Sound pairs of /e/, each pair produced by speakers of one and the same age 
and gender-related speaker group, with small differences in FO and F1 but substantial 
differences in the higher vowel-related spectral range. 
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Figure 10. A sound pair of /i/ and a corresponding pair of /e/, each pair comparing pro- 
ductions of a man and a child, with small differences in FO and F1 but very pronounced 
differences in the higher vowel-related spectral ranges. 
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M7.2 Partial Lack of Manifestation of Vowel-Specific Relative 
Spectral Energy Maxima 


Figures 11 and 12 show examples of sounds of the vowels /a-a/ and 
of /o/ with “flat” or “sloping” spectral portions in their vowel-specific 
frequency range<c.1.5kHz which are lacking a clearly determinable 
peak. Note that the perceived vowel quality of some sounds inten- 
tionally produced as /a—a/ lies in between /a/ and /0/, and of some 
sounds intentionally produced as /o/ in between /o/ and /0/. Note also 
that for the sounds of /a-a/, a dominant first harmonic is again ignored 
here when interpreting relative spectral energy maxima. (For cases of 
“sloping” lower spectral portions in sounds of /u/, see Section M7.1, 
Figures 1 to 3.) 


Figures 13 and 14 show corresponding observations for sounds of 
front the vowels /i, e/ with “flat” higher spectral portions in their upper 
vowel-specific frequency range of 1.5-5 kHz which are lacking a clearly 
determinable pattern of vowel-related peaks. 
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Figure 11. Sounds of /a-a/, produced by children, women and men, which exhibit “flat” 
or “sloping” lower spectral portions <c. 1.5 kHz lacking a clearly determinable vowel- 


related peak. 
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(Figure 11, continuation) 
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(Figure 11, continuation) 
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Figure 12. Sounds of /o/, produced by children, women and men, which exhibit “flat” or 
“sloping” lower spectral portions<c. 1.5 kHz lacking a clearly determinable vowel-related 


peak. 
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(Figure 12, continuation) 
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Figure 13. Sounds of /i/, produced by children, women and men, which exhibit “flat” 
higher spectral portions in the frequency range of 1.5-5kHz lacking a clearly determina- 
ble pattern of vowel-related peaks. 
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(Figure 13, continuation) 


Frequency (Hz) 

100 100 
80 80. 80 
60. | 60. 60 

| | | 
40 PA 40 A | || | v 40 | | | | 
20 U | | 20. „AW WW 20. 
0 l 0. CW D W 

0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 


13-13 [i] FO=395Hz 86-m-C R28093 


13-14 [i] FO=396Hz 31-w-A R16658 


13-15 [i] FO=398Hz 34-w-A R17391 


| i | || | | | 
ion A vu al - Kai d D - Y D D 


13-16 [i] FO=400Hz 49-w-A R20361 


13-17 [i] FO=402Hz 34-w-A R17390 


13-18 [i] FO=402Hz 57-w-A R22338 


80 
80 80. 
60 
60 60. 
| 40. 
+ 
40 40. y \ | 
V LI: | 
20 20 | 
0. 
0. 
20 
D 1000 2000 3000 4000 5000 D 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 


13-19 [i] FO=403Hz 1-w-A R10135 


13-20 [i] F0=461Hz 266-m-C R44766 


13-21 [i] F0=487Hz 266-m-C R44804 


i „| | afd 
Audi Punk AU 
yy 7 "Adel ` Seen, 


13-22 [i] F0=492Hz 94-m-C R29325 


154 


13-23 [i] FO=496Hz 64-w-C R23980 


13-24 [i] F0=497Hz 31-w-A R16659 


Materials Part III 


SPL (dB/Hz) 


(Figure 13, continuation) 
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Figure 14. Sounds of /e/, produced by children, women and men, which exhibit “flat” 
higher spectral portions in the frequency range of 1.5-5kHz lacking a clearly determina- 
ble pattern of vowel-related peaks. 
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(Figure 14, continuation) 
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M8 Lack of Correspondence between Vowels 
and Patterns of Relative Spectral Energy 
Maxima or Formant Patterns 


M8.1 Dependence of Vowel-Specific, Relative Spectral Energy 
Maxima and Lower Formants <1.5kHz on Fundamental 
Frequency 


Figure 1 shows examples of sounds of the vowels /o, @, e/ produced 
at different FO by a woman (/o/), a man (/6/) and a child (/e/; age 8). 
In the frequency range of FO of c. 200-400Hz, the second partial is 
generally dominant thus indicating a shift of the lowest spectral peak 
with rising FO, which is also indicated by the corresponding calculated 
F1. In more detail: For the sound series of the vowel /o/, the shift in FO 
is 170-400 Hz, the frequency shift of the dominant second harmonic 
is 340-800 Hz and the shift of calculated F1 is c. 380-800 Hz. (Note 
that for the sound at FO = 400 Hz, the first calculated formant value at 
560 Hz is ignored here because it is associated with a bandwidth of 
928 Hz and, as a consequence, the LPC filter curve does not show a 
corresponding peak.)—For the sound series of the vowel /a/, the shift 
in FO is c.110-360 Hz, the frequency shift of the dominant harmonic 
(third harmonic up to FO = 167 Hz, then second harmonic) is c. 330- 
720Hz and the shift of calculated F1 is c. 350-710Hz.—For the sound 
series of the vowel /e/, the shift in FO is c. 210-360 Hz, the frequency 
shift of the dominant second harmonic is c. 420-720 Hz (dominance 
is weak but constant) and the shift of calculated F1 is c. 420-720 Hz. 


Figure 2 shows examples of sounds of the vowels /u, y, i/ produced at 
different FO by a woman (/u/), a child (/y/; age 13, transition to adoles- 
cence) and a woman (/i/). For all sounds, the first partial is generally 
dominant thus indicating a shift of the lowest spectral peak with rising 
FO, which is also indicated by the corresponding calculated F1. (Note 
that for higher levels of FO, the calculation of F1 is methodically unsub- 
stantiated; however, the calculated values correspond to the dominant 
first harmonics.) In more detail: For the sound series of the vowel /u/, 
the shift in FO is c. 220-870Hz, as is true for the frequency shift of 
the first dominant harmonic and the shift of calculated F1 is c. 230- 
870Hz.—For the sound series of the vowel /y/, the shift in FO is c. 210— 
710HZ, as is true for the frequency shift of the first dominant harmonic, 
and the shift of calculated F1 is c. 380-740 Hz. (Note the problem of 
automatic calculation of F1 for the example in Figure 2-14.)—For the 
sound series of the vowel /i/, the shift in FO is c. 210-830 Hz, as is true 
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for the frequency shift of the first dominant harmonic and the shift of 
calculated F1 is c. 240-900 Hz. 


Note the very pronounced spectral differences for the three sounds of 
/i, y, u/in the frequency range of FO of 700-800 Hz which reinforces the 
thesis of a parallelism between differences in perceived vowel quality 
and related acoustic differences, that is, the thesis of vowel-specific 
harmonic spectra of high-pitched sounds. 


However, as mentioned in Section 8.1, indications for an FO-depend- 
ence of the lower spectral peaks and lower formants <1.5kHz are not 
systematic: above all, the indications in question relate to frequency 
ranges of FO, to vowel qualities and to single speakers and their pho- 
nation characteristics, including vocal effort. 


Concerning the FO ranges, the indications for the FO-dependence in 
question are generally weak or absent for FO<c. 200 Hz for the sounds 
of all vowels (See, for example, Figure 1 in this chapter, the correspond- 
ing sounds of /9/). 


Concerning vowel quality, the indications of the F0-dependence in 
question are particularly evident in the sounds of /i, y, e, o, o, u/ but 
often unsystematic, weak or even absent for the sounds of /e/ and of 
/a-a/. In terms of an illustration, Figure 3 shows examples of sounds of 
/a-a/ produced by a child (age 13, transition to adolescence) on differ- 
ent FO. The harmonic spectrum strongly varies and peak and formant 
estimation is difficult to conduct. However, no clear indication of a re- 
lation between FO and the lower spectral envelope is evident. 


Concerning single speakers and their phonation characteristics, in- 
cluding vocal effort, Figure 4 shows examples of sounds of /o/ pro- 
duced at different FO by a woman; in contrast to the corresponding 
sound series in Figure 1, only a very weak indication of a relation be- 
tween FO and the lower spectrum is evident. 


But, as mentioned in Section 8.1, although the indications for the de- 
pendence discussed here prove to be unsystematic, the findings of 
intelligible vowel sounds at fundamental frequencies > 500 Hz (see next 
chapter) and of formant pattern ambiguity (see Chapter M9) force us to 
relate the lower spectral peaks and the lower formants to fundamental 
frequency. 


In addition, such a dependence can also be observed for the second 
formant for cases of sounds of back vowels (see, for example, Section 
10.1, Figure 1). 


M8.1 Dependence of Vowel-Specific, Relative Spectral Energy Maxima 159 
and Lower Formants< 1.5kHz on Fundamental Frequency 


In the context of such F1 shifts with rising FO, “inverted” frequency 
levels of the lowest spectral peak and of calculated F1 can be observed 
for two sounds of two different vowels: where statistical values give 
lower formant frequencies for F1 for one vowel quality than for the 
other, higher values can be found for sounds of the former than for 
sounds of the latter if FO variations are included into the investigation. 
Figures 5 shows examples of such cases in terms of sound pairs of /o, 
u/ and /e, i/. (The sound pairs produced by children, women and men 
are presented separately.) The lowest spectral peaks < 1.5 kHz for the 
sounds of /u/ are above those of the sounds of /o/, as is the case for 
the sounds of /i/ compared with the sounds of /e/. Moreover, no clear 
indication of a second peak<1.5kHz and a corresponding marked F2 
is manifest for the sounds of /o, u/, and the calculated F2 for the sound 
pairs of /e, i/ are also “inverted”, i.e. F2 for the sounds of /i/ is found 
below F2 for the sounds of /e/. 


This observation foreshadows formant pattern ambiguity of vowel sounds, 
as documented in detail in Chapter M9. 


For earlier accounts, see Maurer, Landis, and d’Heureuse (1991), Maurer 
and Landis (1995, 1996, 2000); see also Traunmiller (n.d.) for synthe- 
sised examples. 
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Sound Pressure Level (dB/Hz) 


Figure 1. Sounds of Jo. ø, ei produced at different FO by a woman (/o/), a man (/6/) and 
a child (/e/) indicating a shift of the lowest spectral peak as well as of calculated F1 with 


rising FO. 
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(Figure 1, continuation) 
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Figure 2. Sounds of /u, y, i/ produced at different FO by a woman (/u/), a child (/y/) and 
another woman (/i/) indicating a shift of the lowest spectral peak as well as of calculated 


F1 with rising FO. 
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(Figure 2, continuation) 
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(Figure 2, continuation) 
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Figure 3. Sounds of/a-a/, produced at different FO by a child, for which there is no clear 
indication of a relation between FO and the lower spectral envelope (even if the harmonic 


spectrum strongly varies). 
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Figure 4. Sounds of/o/, produced at different FO by a woman, for which only a very weak 
indication of a relation between FO and the lower spectrum is manifest. 
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and Lower Formants< 1.5kHz on Fundamental Frequency 


Figure 5. Three sound pairs of /o, u/ and three sound pairs of /e, i/, produced by chil- 
dren, women and men, exhibiting a higher first spectral peak frequency for /u/ than for 
/o/, and for /i/ than for /e/, respectively. Note also the absent second spectral peak 
<1.5kHz for the sounds of the back vowels and higher calculated F2 for /e/ than for /i/. 
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(Figure 5, continuation) 
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and Lower Formants< 1.5kHz on Fundamental Frequency 


M8.2 Vowel Perception at Fundamental Frequencies above 
Statistical Values of the Respective First Formant Frequency 


Figure 6 shows intelligible high-pitched sounds of the vowels /y, e, Ø, 
g, o/ at FO of c. 750 Hz, and Figure 7 exhibits intelligible high-pitched 
sounds of the corner vowels /i, a, u/ at FO of c. 850Hz. Note again the 
pronounced spectral differences for these high-pitched sounds of dif- 
ferent vowels supporting the thesis of a parallelism between differenc- 
es in perceived vowel quality and related acoustic differences, that is, 
the thesis of vowel-specific harmonic spectra. 


Figures 8 to 10 show examples of speech extracts of untrained speak- 
ers, journalists, TV hosts and actresses and actors, which manifest pitch 
contours for utterances of single speakers exceeding age- and gen- 
der-related statistical F1 of the vowels /i, y, u/ (450Hz for children, 
400Hz for women and 350Hz for men). The ranges of FO indicated— 
overall ranges for the speech sounds of a single speaker or a group of 
speakers (see below) —were determined acoustically in terms of ap- 
proximations by listening to the sounds. (Please ignore some errors in 
the graphics exceeding the verified ranges given below. These errors 
are due, for example, to background noise or music, or the sound of an 
audience or to automatic pitch calculation.) The order of presentation 
within a figure accords, firstly, to the number of examples per speaker 
or a group of speakers, and secondly, to the identification number of 
the speaker. 


Figure 8 shows pitch contours of speech extracts produced by un- 
trained speakers, journalists, TV hosts and actresses talking on TV (not 
acting), to experience in every day life: 


— The examples for speaker 172 (see pitch contours 8-1 to 8-3) 
relates to extracts of a woman selling grilled chicken in a mar- 
ket in Paris. Overall range of FO = c. 220-700 Hz (excluding very 
high-pitched exclamations). 

— The examples for the two speakers subsumed under the ID 
number 379 and for the speaker 380 (see pitch contours 8-4 to 
8-6) relate to extracts of two American women and one Amer- 
ican man demonstrating infant child directed speech. Overall 
range of FO = c. 200-800Hz for the women (except one higher 
peak at c. 1 kHz) and c. 150-600 Hz for the man. 

— The examples for speaker 336 (see pitch contours 8-7 and 8-8, 
the latter from 0.7 to 2.5 sec.) relate to extracts of a female Indo- 
nesian singer talking in a TV show and to an exclamation of her 
name during the show. Overall range of FO = c. 350-950 Hz. 
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The two examples for the speakers subsumed under the ID num- 
ber 348 (see pitch contours 8-9 and 8-10) relate to extracts of 
two female TV hosts announcing the results of a singing contest 
(announcements in English). Overall range of FO = c. 200-700 Hz. 
The example for speaker 135 (see pitch contour 8-11) relates to 
two sentences of a boy (age 6). Range of FO = c. 220-600 Hz. 
The example for speaker 174 (see pitch contour 8-12) relates 
to an extract of a female North American journalist speaking on 
television. Range of FO = c. 175-600 Hz. 

The example for speaker 217 (see pitch contour 8-13) relates to 
an extract of a North American woman talking about her child on 
television. Range of FO = c. 160-550 Hz. 

The example for speaker 220 (see pitch contour 8-14) relates to 
an extract of a female French doctor talking on television. Range 
of FO = c. 250-520 Hz. 

The example for speaker 238 (see pitch contour 8-15) relates 
to an extract of a male French TV host. Range of FO = c. 130- 
420 Hz (exceeding only gender-related statistical F1 of the vow- 
els /i, y, u/). 

The example for speaker 383 (see pitch contour 8-16) relates to 
an extract of a French woman talking on television in a TV spot. 
Range of FO = c. 220-830 Hz. 

The example for two speakers subsumed under the ID number 
379 (see pitch contour 8-17) relates to an extract of a female 
French journalist (first part) questioning a French woman on the 
street, and the answer of the latter (second part). Overall range 
of FO for the utterances of both women = c. 230-600 Hz. 


Figure 9 shows pitch contours of speech extracts of performing ac- 
tresses (film, comic, voice-over, dubbing): 


The example for speaker 216 (see pitch contours 9-1 and 9-6) 
relates to extracts of a female Swiss narrator of fairy tales. Over- 
all range of FO = c. 150-900 Hz. 

The examples for speaker 177 (see pitch contours 9-7 to 9-9) re- 
late to extracts of a French comic actress performing on stage. 
Overall range of FO = c. 180-780 Hz. 

The examples for speaker 178 (see pitch contours 9-10 to 9-12) 
relate to extracts of another French comic actress performing on 
stage. Overall range of FO = c. 200-850 Hz. 

The examples for speaker 212 (see pitch contours 9-13 to 9-15) 
relate to extracts of the speech of a French actress in a cartoon. 
Overall range of FO = c. 300-700 Hz. 
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of the Respective First Formant Frequency 


172 


The examples for speakers 251 (see pitch contours 9-16 to 
9-18) relate to extracts of two British actresses performing as 
the voices of the two main characters in a computer-animated 
fantasy film. Overall range of FO = c. 150-800 Hz. 

The examples for speaker 276 (see pitch contours 9-19 to 9-21) 
relate to extracts of a French comedy actress performing on 
stage. Overall range of FO = c. 400-780 Hz. 

The example for speaker 175 (see pitch contour 9-22) relates to 
an extract of a North American actress performing as a female 
character in a film. Range of FO = c. 270-700 Hz (excluding one 
high-pitched exclamation at FO of c. 880 H2). 

The example for speaker 223 (see pitch contour 9-23) relates to 
an extract of a German actress dubbing a female character in a 
film. Range of FO = c. 220-780 Hz (excluding one high-pitched 
exclamation at the end). 

The example for speaker 234 (see pitch contour 9-24) relates 
to an extract of a French comic actress performing on stage. 
Range of FO = c. 200-850 Hz. 

The example for speaker 258 (see pitch contour 9-25) relates 
to an extract of a French actress performing as the voice of a 
female character in an animation film. Range of FO = c. 220- 
780 Hz. 

The example for speaker 275 (see pitch contour 9-26) relates 
to an extract of a German comic actress performing on stage. 
Range of FO = c. 180-850 Hz. 

The example for speaker 291 (see pitch contour 9-27) relates to 
an extract of a British actress performing in a fantasy film. Range 
of FO = c. 100-700 Hz. 

The example for speaker 296 (see pitch contour 9-28) relates to 
an extract of a German comic actress. Range of FO = c.150- 
600 Hz. 

The example for speaker 350 (see pitch contour 9-29) relates 
to an extract of a North American actress performing as a fe- 
male character in a film. Range of FO = c. 160-900 Hz (excluding 
some very high-pitched exclamations). 

The example for speaker 398 (see pitch contour 9-30) relates to 
an extract of a North American actress performing as a female 
character in a TV series. Range of FO = c. 300-980 Hz. 


Materials Part III 


Figure 10 shows pitch contours of speech extracts of performing ac- 
tors (film, comic, voice-over, dubbing): 


The examples for speaker 225 (see pitch contours 10-1 to 10-4) 
relate to speech extracts of a Swiss comic actor performing as 
a female character. Overall range of FO = c. 220-780 Hz. 

The examples for speaker 163 (see pitch contours 10-5 to 10-7) 
relate to extracts of an Indonesian comic actor performing on 
stage in a Drama Gong. Overall range of FO = c. 300-600 Hz. 
The examples for speaker 169 (see pitch contours 10-8 and 10- 
10) relate to extracts of a German actor dubbing a male charac- 
ter in a film. Overall range of FO = c. 100-700 Hz. 

The examples for speaker 214 (see pitch contours 10-11 to 10- 
13) relate to extracts of a Japanese Kabuki actor. Overall range 
of FO = c. 250-700 Hz. 

The examples for speaker 297 (see pitch contours 10-14 to 10- 
16) relate to extracts of speech of another Swiss comic actor 
performing in a TV show. Overall range of FO = c. 130-620 Hz. 
The examples for speaker 194 (see pitch contours 10-17 and 
10-18) relate to extracts of a French comic actor performing on 
stage. Overall range of FO = c. 130-700 Hz. 

The example for speaker 394 (see pitch contours 10-19 and 10- 
20) relates to extracts of two French actors performing as the 
voices of male characters in an animation film. Overall range of 
FO = c. 310-650Hz. 

The example for speaker 171 (see pitch contour 10-21) relates 
to extracts of speech of a German actor dubbing the voice of a 
male character. Range of FO = c. 180-550 Hz. 

The example for speaker 274 (see pitch contour 10-22) relates to 
extracts of speech of a Swiss actor performing as ventriloquist. 
Range of FO = c. 120-600 Hz. 

The example for speaker 294 (see pitch contour 10-23) relates 
to an extract of speech of a North American actor performing as 
the voice of a female character in a comedy-variety film. Range 
of FO = c. 200-800 Hz. 

The example for speaker 351 (See pitch contour 10-24) relates 
to an extract of speech of a German comic actor performing in 
a TV show. Range of FO = c.150-580Hz (excluding one high- 
pitched exclamation at FO of c. 780Hz). 


For earlier accounts, see Maurer and Landis (1996, 2000), Maurer, Mok, 
Friedrichs, and Dellwo (2014), Friedrichs, Maurer, and Dellwo (2015), 
Friedrichs, Maurer, Suter, and Dellwo (2015). 
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SPL (dB/Hz) 


Figure 6. Five intelligible sounds of /y, e, o, €, o/ produced by children and women at FO 


in the range of 700-800 Hz. 
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SPL (dB/Hz) 


Figure 7. Three intelligible sounds of the corner vowels /i, a, u/ produced by women at 


FO of c. 850Hz. 
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of the Respective First Formant Frequency 


Figure 8. Pitch contours of speech extracts produced by untrained speakers, journal- 
ists, TV hosts and actresses talking on TV (not acting), to experience in every day life. 
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FO range for speaker 135=c.220-600Hz 
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Materials Part III 


(Figure 8, continuation) 
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Figure 9. Pitch contours of extracts of speech produced by actresses while performing 
(film, comic, voice-over, dubbing). 
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FO range for speaker 178=c.200-850Hz 


178 


9-11 [speech] 178-w-A R38680 
FO range for speaker 178=c.200-850Hz 
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(Figure 9, continuation) 
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9-22 [speech] 175-w-A R46869 


FO range for speaker 175=c.270-700Hz 
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(Figure 9, continuation) 
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FO range for speaker 350=c.160-900Hz 
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FO range for speaker 398=c.300-980Hz 
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Figure 10. Pitch contours of extracts of speech produced by actors while performing 
(film, comic, voice-over, dubbing). 
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10-10 [speech] 169-m-A R43753 


FO range for speaker 169=c.100-700Hz 
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FO range for speaker 214=c.250-700Hz 
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(Figure 10, continuation) 
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FO range for speaker 394=c.310-650Hz 
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FO range for speaker 274=c.120-600Hz 
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FO range for speaker 294=c.200-800Hz 
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M8.3 “Inversions” of Relative Spectral Energy Maxima 
and Minima and “Inverse” Formant Patterns in Sounds 
of Individual Vowels 


For each of the vowels /a—a, o, u/ and for each speaker group, Fig- 
ures 11 to 13 show pairs of sounds produced at different fundamental 
frequencies exhibiting “inverse” relative spectral maxima and minima 
in terms of “inverse” spectral envelope curves < 1.5: whereas a rela- 
tive minimum in the spectral envelope occurs for one sound of a pair, 
a peak for the other sound is manifest, and vice versa; however, the 
perceived vowel quality is maintained. The same holds true for com- 
parisons of the respective calculated filter curves and, for most cases, 
for comparisons of patterns of manifest formants. 


M8.3 “Inversions” of Relative Spectral Energy Maxima and Minima and 183 
“Inverse” Formant Patterns in Sounds of Individual Vowels 


Figure 11. Sounds of /a-a/, produced at different FO by children, women and men, 
which exhibit “inverse” relative spectral maxima and minima in terms of “inverse” spec- 


tral envelope curves <1.5kHz. 
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Figure 12. Sounds of /o/, produced at different FO by children, women and men, which 
exhibit “inverse” relative spectral maxima and minima in terms of “inverse” spectral en- 


velope curves < 1.5 kHz. 
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“Inverse” Formant Patterns in Sounds of Individual Vowels 


Figure 13. Sounds of /u/, produced at different FO by children, women and men, which 
exhibit “inverse” relative spectral maxima and minima in terms of “inverse” spectral en- 
velope curves < 1.5 kHz. 
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M9 Ambiguous Correspondence between 
Vowels and Patterns of Relative Spectral 
Energy Maxima or Formant Patterns 
or Complete Spectral Envelopes 


M9.1 Ambiguous Patterns of Relative Spectral Energy Maxima 
and Ambiguous Formant Patterns 


Figures 1 to 21 show series of sounds of different vowels produced at 
different FO but exhibiting similar patterns of relative spectral energy 
maxima and/or similar patterns of calculated formant frequencies with- 
in their supposed vowel-specific frequency range related to statistical 
F1 and F2. In all cases, the actual differences of the patterns for the 
sounds of different vowels presented in a single series are far smaller 
than the observable differences (variations) of corresponding patterns 
for sounds of a single vowel.—In some series that include sounds at 
high fundamental frequencies, the overall spectral envelopes and the 
harmonic spectra are considered for the comparison in question. 


For each series, roughly estimated average frequencies of the two low- 
er relative spectral energy maxima and/or of the calculated frequen- 
cies F1-F2 are given below in terms of model patterns for the sounds 
compared. Exceptions concern a few comparisons of sounds of back 
vowels, for which only a single spectral peak is manifest in the sound 
spectra (for these comparisons, the corresponding peak frequency is 
given), and an additional exception concerns a comparison of sounds 
/a-a, u/, for which only the spectrum as such> 1.5 kHz is considered. 


The first sound series shown include sounds of the vowels /a—a, o, u/, 
divided into two groups, one presenting sounds of different speak- 
ers, the other presenting sounds of single speakers. The second se- 
ries shown include sounds of front vowels, again divided into the two 
groups mentioned. (Figures 9 and 11 include exceptions that illustrate 
the ambiguity discussed for sounds of different and of single speak- 
ers.) Within a series, the sounds are organised according to fundamen- 
tal frequency. 


Comparisons of sounds of back vowels and of /a—a/ produced by dif- 
ferent speakers: 
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Figure 1 Sounds of /a-a, o, u/; model pattern of spectral peaks 

and/or of calculated formant frequencies = 600-1200 Hz 
Figure 2 Sounds of /a-a, o, u/; model pattern of spectral peaks 

and/or of calculated formant frequencies = 600-1050 Hz 
Figure 3 Sounds of /a-a, 0/; model pattern of spectral peaks and/ 

or of calculated formant frequencies = 660-1320 Hz 
Sounds of /u/ are included in the first three series because the first 
harmonic corresponds to F1 of the model pattern in question; however, 
no clear spectral indication can be found for F2 even if LPC analysis 
gives a (weak) second formant at a frequency level which corresponds 
to the model pattern of a series. 


Comparisons of sounds of back vowels and of /a-a/ produced by sin- 
gle speakers: 


Figure 4 Three comparisons of sounds of /a—a, o, u/ produced by 
a man and two women; model pattern of spectral peaks 
and/or of calculated formant frequencies = 600-1200 Hz 

Figure 5 Two comparisons of sounds of /a-a, o/ produced by a 
man (sounds sung by a tenor); model pattern of spectral 
peaks and/or of calculated formant frequencies = 600- 
1200Hz for the first comparison, similar spectral peaks and 
spectral envelopes for the second comparison 

Figure 6 Sounds of /a-a/ and of /u/ produced by a woman which 
exhibit comparable spectral envelopes < 1.5 kHz 

Figure 7 Sounds of /9, o, u/ produced by a woman; model pattern 
of spectral peaks and/or of calculated formant frequen- 
cies = one clear peak at c. 550 Hz (exceptionally, sounds 
of the vowel /0/ are included in order to show a possible 
shift in perceived vowel quality from /9/ to /o/ related to 
two levels of FO of c. 175 Hz and c. 260 Hz) 

Figure 8 Two comparisons of sounds of /o, u/ produced by two 
children (age 12 and 6); model patterns of spectral peaks 
and/or of calculated formant frequencies = one clear peak 
at c. 400 Hz (first sound pair) and at c. 520Hz (second 
sound pair), respectively. 


Comparisons of sounds of front vowels produced by different speakers: 


In contrast to many other comparisons presented in this chapter, the 
ambiguity illustrated in Figures 9 to 11 does not always relate to sub- 
stantial differences in FO but also to the configuration of the levels of 
the harmonics, to the spectrum above F2 and to the levels of calculat- 
ed formants including F3. This is the case particularly for direct com- 
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parisons of sounds of /e/ and of /a/, and of /i/ and of /y/, respectively. 
Moreover, a sound produced with creak phonation is exceptionally in- 
cluded into the comparison (see the first vowel spectrum of Figure 9). 


Figure 9 


Figure 10 


Figure 11 


Figure 12 
Figure 13 
Figure 14 
Figure 15 
Figure 16 
Figure 17 


Figure 18 


Sounds of Jo e, y, i/; model pattern of spectral peaks 
and/or of calculated formant frequencies = 330-2000 Hz; 
note the ambiguity for sounds of Jo. i/ for the single speak- 
er 391 and the ambiguity for the sounds of Jo. vi for the 
single speaker 376 

Sounds of Jo e, y, i/; model pattern of spectral peaks 
and/or of calculated formant frequencies = 350-2150 Hz 
Sounds of Jo e, y, i/; model pattern of spectral peaks and/ 
or of calculated formant frequencies = 420-2150 Hz; note 
the ambiguity for sounds of Zo. y/ for the single speaker 402 
Sounds of /£, e, i/; model pattern of spectral peaks and/ 
or of calculated formant frequencies = 500-2250 Hz 
Sounds of /£, e, i/; model pattern of spectral peaks and/ 
or of calculated formant frequencies = 600-2450 Hz 
Sounds of /e, i/; model pattern of spectral peaks and/ 
or of calculated formant frequencies = 400-2600 Hz 
Sounds of Je, e, vi: model pattern of spectral peaks and/ 
or of calculated formant frequencies = 500-2000 Hz 
Sounds of Je, ø, y/; model pattern of spectral peaks and/ 
or of calculated formant frequencies = 430-2000 Hz 
Sounds of Je, 6, y/; model pattern of spectral peaks and/ 
or of calculated formant frequencies = 475-1900 Hz 
Sounds of /e, y/; model pattern of spectral peaks and/ 
or of calculated formant frequencies = 650-1950 Hz 


Comparisons of sounds of front vowels, produced by single speakers: 


Figure 19 


Figure 20 


Figure 21 


Two comparisons of sounds of Je. e, i/ produced by two 
women; model patterns of spectral peaks and/or of calcu- 
lated formant frequencies = 510-2550 Hz and 600-2400 Hz, 
respectively 

Three comparisons of sounds of /e, i/ produced by three 
children (age 7 to 9); model patterns of spectral peaks and/ 
or of calculated formant frequencies = 450-3000 Hz and 
400-3000 Hz, respectively 

Three comparisons of sounds of Jo. y/ produced by aman, 
a woman and a child (age 12); model patterns of spectral 
peaks and/or of calculated formant frequencies = 320- 
1600 Hz, 320-2000 Hz and 400-2000 Hz, respectively 


For earlier accounts, see Maurer and Landis (2000). 
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Sound Pressure Level (dB/Hz) 


Figure 1. Sounds of /a-a, o, u/ produced by different speakers; related model pattern of 
spectral peaks and/or of calculated formant frequencies = 600-1200 Hz. 
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Figure 2. Sounds of /a-a, o, u/ produced by different speakers; related model pattern of 
spectral peaks and/or of calculated formant frequencies = 600-1050 Hz. 
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Figure 3. Sounds of /a-a, o, u/ produced by different speakers; related model pattern of 
spectral peaks and/or of calculated formant frequencies = 660-1320 Hz. 
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Figure 4. Three comparisons of sounds of /a—a, o, u/ produced by a man and two wom- 
en; related model pattern of spectral peaks and/or of calculated formant frequencies = 


600-1200Hz. 
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Figure 5. Two comparisons of sounds of /a—a, o/ produced by a man (Sounds sung by a 
tenor); related model pattern of spectral peaks and/or of calculated formant frequencies 
= 600-1200 Hz for the first comparison; similar spectral peaks and spectral envelopes 
for the second comparison. 
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Figure 6. Sounds of /a-a/ and of /u/, produced by a woman, which exhibit comparable 
spectral envelopes<1.5kHz. 
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Figure 7. Sounds of /9, 0, u/ produced by a woman; related model pattern of spectral 
peaks and/or of calculated formant frequencies = one clear peak at c.550 Hz. 
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Figure 8. Two comparisons of sounds of /o, u/ produced by two children (age 12 and 
6); related model patterns of spectral peaks and/or of calculated formant frequencies 
= one clear peak at c.400Hz (first sound pair) and at c.520Hz (second sound pair), 


respectively. 
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Figure 9. Sounds of Jo. e, y, i/ produced by different speakers; model pattern of spectral 
peaks and/or of calculated formant frequencies = 330-2000 Hz. 
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Figure 10. Sounds of Jo. e, y, i/ produced by different speakers; model pattern of spec- 
tral peaks and/or of calculated formant frequencies = 350-2150Hz. 
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Figure 11. Sounds of Jo. e, y, i/ produced by different speakers; model pattern of spec- 
tral peaks and/or of calculated formant frequencies = 420-2150Hz. 
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Figure 12. Sounds of /e, e, i/ produced by different speakers; related model pattern of 
spectral peaks and/or of calculated formant frequencies = 500-2250 Hz. 
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Figure 13. Sounds of /e, e, i/ produced by different speakers; related model pattern of 
spectral peaks and/or of calculated formant frequencies = 600-2450 Hz. 
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Figure 14. Sounds of Je, i/ produced by different speakers; related model pattern of 
spectral peaks and/or of calculated formant frequencies = 400-2600 Hz. 
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Figure 15. Sounds of /e, e, vi produced by different speakers; related model pattern of 
spectral peaks and/or of calculated formant frequencies = 500-2000 Hz. 
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(Figure 15, continuation) 
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Figure 16. Sounds of /e, ø, y/ produced by different speakers; related model pattern of 
spectral peaks and/or of calculated formant frequencies = 430-2000 Hz. 
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Figure 17. Sounds of /e, ø, y/ produced by different speakers; related model pattern of 
spectral peaks and/or of calculated formant frequencies = 475-1900 Hz. 
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Figure 18. Sounds of /e, y/ produced by different speakers; related model pattern of 
spectral peaks and/or of calculated formant frequencies = 650-1950Hz. 
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Figure 19. Two comparisons of sounds of /e, e, i/ produced by two women; related mod- 
el patterns of spectral peaks and/or of calculated formant frequencies = 510-2550Hz 
and 600-2400 Hz, respectively. 
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Figure 20. Three comparisons of sounds of /e, i/ produced by three children (age range 
7 to 9); related model patterns of spectral peaks and/or of calculated formant frequencies 
= 450-3000 Hz and 400-3000 Hz, respectively. 
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Figure 21. Three comparisons of sounds of Jo. y/ produced by a man, a woman and 
a child (age 12); related model patterns of spectral peaks and/or of calculated formant 
frequencies = 320-1600Hz, 320-2000 Hz, and 400-2000 Hz, respectively. 
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M9.2 Ambiguous Spectral Envelopes 


For the frequency range relevant for the perceived vowel qualities in 
question, many of the sound series presented in the previous chap- 
ter do not only show similar patterns of vowel-related spectral peaks 
and similar patterns of calculated F1-F2 but also similar vowel-relat- 
ed spectral envelope shapes for sounds of different vowels, including 
similar patterns of calculated F1-F2-F3 for sounds of front vowels (for 
all calculated formant frequencies refer to the online digital version of 
the Materials). 


M9.3 Ambiguity and Individual Vowels 


The series in Section M9.1 present ambiguities as discussed here for 
all combinations of the long German back vowels and /a—a/ and for all 
combinations of the long German front vowels. Thus, the ambiguities 
are not a phenomenon of overlapping F1-F2 spaces of neighbouring 
vowel qualities but, in most cases, a consequence of the depend- 
ence of vowel-specific, relative spectral energy maxima and lower 
formants < 1.5kHz on fundamental frequency, interrelated with an ob- 
servable variation of higher vowel-related spectral parts for sounds of 
front vowels. 


However, two restrictions apply. 


Concerning the sounds of back vowels and of /a—a/ investigated, the 
demonstration of a possible ambiguity of the lower spectral envelope 
and of F1-F2 is unquestionable for comparisons of sounds of /u/ and 
of /o/, and of /o/ and of /a—a/. For the comparison of sounds of /u/ and 
of /a-a/, however, the demonstration of a possible ambiguity is limit- 
ed to similar calculated F1-F2, but because of high FO of the sounds 
of /u/, this calculation is methodically unsubstantiated. Further direct 
comparison of the spectral envelope and the configuration of the levels 
of the harmonics generally provides no clear indication. Notwithstand- 
ing, it is important to consider the fact that sounds of /u/ can be pro- 
duced at a level of FO that can corresponds to F1 of sounds of /a/ and 
that, in such cases, exhibit a dominant first harmonic. 


Concerning the sounds of front vowels investigated, the demonstra- 
tion of a possible ambiguity, which is related to differences in FO of the 
sounds compared, does not concern the direct comparisons of sounds 
of Je, oi, and of /i, y/. As mentioned, in such cases, the ambiguity re- 
lates to the configuration of the levels of the harmonics, to the spec- 
trum above F2 and to the levels of calculated formants including F3. 
This phenomenon is again illustrated in the following three figures. 
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Figure 22 Three sound pairs of JN, i/, each pair produced by single 
female speakers; model patterns of spectral peaks and/ 
or of calculated formant frequencies = 290-2150 Hz, 315— 
2100Hz and 350-2100 Hz, respectively 

Figure 23 Sounds of /y, i/ produced by different male speakers; mod- 
el pattern of spectral peaks and/or of calculated formant 
frequencies = 230-2050 Hz 

Figure 24 A sound pair of Jo ei produced by a single male speak- 
er; model pattern of spectral peaks and/or of calculated 
formant frequencies = 350-1700 Hz 
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Figure 22. Three sound pairs of /y, i/, each pair produced by single female speakers; 
model patterns of spectral peaks and/or of calculated formant frequencies = 290-2150 Hz, 


315-2100 Hz and 350-2100 Hz, respectively. 
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Figure 23. Sounds of /y, i/ produced by different male speakers; model pattern of spec- 
tral peaks and/or of calculated formant frequencies = 230-2050 Hz. 
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Figure 24. A sound pair of Jo. ei produced by a single male speaker; model pattern of 
spectral peaks and/or of calculated formant frequencies = 350-1700 Hz. 
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M10 Lack of Correspondence between 
Patterns of Relative Spectral Energy 
Maxima or Formant Patterns and Age- 
and Gender-Related Speaker Groups 
or Vocal-Tract Sizes 


M10.1 Similar Patterns of Relative Spectral Maxima and Similar 
Formant Patterns < 1.5 kHz for Different Age- and 
Gender-Related Speaker Groups or Vocal-Tract Sizes 


Figure 1 shows sounds of the vowel /o/ produced by a child (age 8), a 
woman and a man. Each speaker produced sounds at different FO ina 
way that allowed for a comparison of the sounds of the three speakers 
(representing the three main speaker groups according to age and gen- 
der) at different and similar FO. The comparison shows that age- and 
gender-related differences < 1.5kHz as given in formant statistics for 
citation-form words can decrease or even disappear if FO of the vo- 
calisations correspond for children, women and men. In this regard, 
comparisons of vocalisations of /o/ are of special interest (and shown 
first) because an FO-dependence of the lower spectral frequency range 
can be observed for FO clearly below statistical F1, and because the 
frequency ranges 1.5kHz covers the entire range related to the vowel 
identity in question. — Data for speakers, ranges of FO and calculated 
F1 and F2: 


Spectra 1-1 to 1-6 Child; FO = 196-322 Hz, F1 = 424-624 Hz, 
F2 = 777-1092 Hz 

Spectra 1-7 to 1-13 Woman; FO = 162-320 Hz, F1 = 363-576 Hz, 
F2 = 804-1141 Hz 

Spectra 1-14 to 21 Man; FO = 129-326 Hz, F1 = 343-577 Hz, 
F2 = 672-1143Hz 
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<1.5kHz for Different Age- and Gender-Related Speaker Groups 


Figure 2 demonstrates this phenomenon for sounds of the vowel /e/ 
produced by a child (age 10), a woman and a man, concerning the 
lowest spectral peak and F1.—Data for speakers, ranges of FO and 
calculated F1: 


Spectra 2-1 to 2-6 Child; FO = 180-330 Hz, F1 = 395-563 Hz 

Spectra 2-7 to 2-13 Woman; FO = 160-325 Hz, F1 = 389-622 Hz 

Spectra 2-14 to 2-21 Man; FO = 122-336 Hz, F1 = 370-566 Hz (exclud- 
ing the last sound for which automatic calcula- 
tion of F1 does not provide a reliable result) 


Similar indications as shown for sounds of /e/ can be found for sounds 
of /0/. 


Figure 3 demonstrates this phenomenon for sounds of the vowel /u/ 
produced by a child (age 8), a woman and a man. However, only the 
first lower peak and calculated F1 are discussed because, for several 
sounds, an interpretation of F2 lacks methodological substantiation. — 
Data for speakers, ranges of FO and calculated F1: 


Spectra 3-1 to 3-6 Child; FO = 237-492 Hz, F1 = 273-492 Hz 
Spectra 3-7 to 3-138 Woman; FO = 177-498 Hz, F1 = 300-502 Hz 
Spectra 3-14 to 3-21 Man; FO = 138-519Hz, F1 = 303-519Hz 


Figure 4 demonstrates this phenomenon for sounds of the vowel /i/ 
produced by a child (age 8), a woman and a man, concerning the lower 
spectral peak and calculated F1.—Data for speakers, ranges of FO 
and calculated F1: 


Spectra 4-1 to 4-6 Child; FO = 247-533 Hz, F1 = 267-534 Hz 
Spectra 4-7 to 4-13 Woman; FO = 177-518Hz, F1 = 279-525 Hz 
Spectra 4-14 to 4-21 Man; FO = 134-534 Hz, F1 = 216-550 Hz 


Similar indications as shown for sounds of /i/ can be found for sounds 
of /y/. 


With regard to sounds of /a—a/, a compilation of corresponding sound 
series similar to those presented for the other vowels often encounters 
some difficulties for two main reasons: spectral peaks and formant 
patterns often do not shift markedly with rising FO, and children often 
produce a very open /a/, while many adults produce an intermediate 
sound of /a-a/ or even a sound of /a/, although all speakers speak 
the same language and live in a geographically limited area. Howev- 
er, Figure 5 demonstrates a case of comparable vowel spectra and 
comparable formant patterns for sounds of /a/ produced by a child 
(age 10), a woman and a man.—Data for speakers, ranges of FO and 
calculated F1: 
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Spectra 5-1 to 5-6 Child; FO = 196-329 Hz, F1 = 759-1055 Hz, 
F2 = 1341-1555 Hz 

Spectra 5-7 to 5-13 Woman; FO = 160-329 Hz, F1 = 706-1007 Hz, 
F2 = 1265-1503 Hz 

Spectra 5-14 to 5-21 Man; FO = 126-324 Hz, F1 = 758-898 Hz, 
F2 = 1232-1431 Hz 


The sounds presented in the previous figures may lead to the question 
whether, with rising FO and related shifts of the lower spectral peaks 
and of the calculated lower formants, the perception of age and gender 
of the speaker alters, i.e. whether the sounds of adults are perceived 
as produced by children at EO > c. 260 Hz, and whether sounds of men 
are perceived as produced by women>c. 200Hz. This may indeed be 
the case for the comparison of the sounds of some speakers, while it 
does not hold true for others. To demonstrate the latter, Figure 6 shows 
similar vowel spectra and similar formant patterns for sounds of the 
vowel /o/ produced by a child (age 10), a woman (untrained speaker) 
and aman (classical opera singer, baritone). For these sounds, the per- 
ceived vowel quality corresponds very well. However, the baritone is 
always perceived as such at all FO of his singing, which is represented 
in his vowel spectra by a so-called “singer’s formant cluster”. (Again, 
only the first lower peak and calculated F1 are discussed since most 
sounds exhibit only one spectral peak; for these sounds, the calculat- 
ed F2 is weak and its role for vowel perception is questionable; see 
Section M7.1.)—Data for speakers, ranges of FO and calculated F1: 


Spectra 6-1 to6-5 Child; FO = 181-348 Hz, F1 = 377-674Hz 
Spectra 6-6 to 6-11 Woman; FO = 168-332 Hz, F1 = 344-593 Hz 
Spectra 6-12 to 6-17 Man; FO = 127-325 Hz, F1 = 386-680 Hz 


As a direct consequence of the documented observations, it follows 
that, for back vowels, the sounds of men (at higher FO) may exhibit 
higher vowel-related spectral peaks and higher calculated F1 or F1-F2 
patterns than the sounds of women (at lower F0). The same holds true 
for the lowest spectral peak and calculated F1 of front vowels and may 
also occur when comparing sounds of adults and children. 
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Figure 7 shows such an “inversion” of expected age- and gender-relat- 
ed differences comparing sounds of the vowel /o/ produced by a child 
and a man, selected from the sound series of the previous Figure 6. If 
the FO of the sounds of the man substantially exceeds the FO of a 
sound of the child, the first spectral peak and calculated F1 of the 
sounds of the man are also above the corresponding peak and F1 of 
the sound of the child (compare Spectra 7-1 to 7-3). The same holds 
true for calculated F2, but as mentioned, the measurement and percep- 
tual role of F2 are in question. However, if the comparison relates to the 
sounds of the man at FO corresponding to statistical values (given for 
citation-form words), the first spectral peak and calculated F1 (and F2) 
are found as lower for the man than for the child, as this is generally 
expected (see Spectra 7-4 and 7-5).—Data for speakers, ranges of FO 
and calculated F1 (and F2), in the order of FO: 


“Inverted” age- or size-related difference 
Spectra 7-1 Child; FO = 223 Hz, F1 = 440 Hz (F2 = 764Hz) 
Spectra 7-2, 7-3 Man; FO = 261-325 Hz, F1 = 511-680Hz 

(F2 = 884-950 Hz) 

“Expected” age- or size-related difference 
Spectra 7-4 Man; FO = 127 Hz, F1 = 430 Hz (F2 = 535 Hz) 
Spectra 7-5 Child; FO = 264Hz, F1 = 538 Hz (F2 = 1069Hz) 


Figure 8 demonstrates this phenomenon < 1.5kHz by comparing sel- 
ected sounds of the vowel /e/ shown in Figure 2.—Data for speakers 
and ranges of FO and calculated F1: 


“Inverted” age- or size-related difference 
Spectra 8-1 Child; FO = 222 Hz, F1 = 449Hz 
Spectra 8-2, 8-3 Man; FO = 260-293 Hz, F1 = 506-566 Hz 
“Expected” age- or size-related difference 
Spectra 8-4 Man; FO = 122 Hz, F1 = 370Hz 
Spectra 8-5 Child; FO = 265Hz, F1 = 518Hz 
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Figure 9 demonstrates this phenomenon<1.5kHz by comparing sel- 
ected sounds of the vowel /u/ shown in Figure 3.—Data for speakers 
and ranges of FO and calculated F1: 


“Inverted” age- or size-related difference 
Spectra 9-1 Child; FO = 237 Hz, F1 = 273Hz 
Spectra 9-2, 9-3 Man; FO = 410-519 Hz, F1 = 412-519Hz 
“Expected” age- or size-related difference 
Spectra 9-4 Man; FO = 138Hz, F1 = 303Hz 
Spectra 9-5 Child; FO = 257 Hz, F1 = 346 Hz 


Figure 10 demonstrates this phenomenon < 1.5 kHz by comparing sel- 
ected sounds of the vowel /i/ shown in Figure 4.—Data for speakers 
and ranges of FO and calculated F1: 


“Inverted” age- or size-related difference 
Spectra 10-1 Child; FO = 247 Hz, F1 = 267 Hz 
Spectra 10-2, 10-3 Man; FO = 441-534Hz, F1 = 444-550 Hz 
“Expected” age- or size-related difference 
Spectra 10-4 Man; FO = 134Hz, F1 = 269 Hz 
Spectra 10-5 Child; FO = 263 Hz, F1 = 301 Hz 


Comparisons are limited to children and men because the correspond- 
ing differences in the vocal-tract sizes are assumed to be highest. 


For earlier accounts, see Maurer, Cook, Landis, and d’Heureuse (1992), 
Maurer, Suter, Friedrichs, and Dellwo (2015b); note also some related 
reflections in Potter and Steinberg (1950). 
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Figure 1. Sounds of /o/ produced by a child, a woman and a man at comparable levels 
of FO. Fig. 1-1 to 1-6 = sounds of the child; Fig. 1-7 to 1-13 = sounds of the woman; Fig. 
1-14 to 1-21 = sounds of the man. 
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Figure 2. Sounds of /e/ produced by a child, a woman and a man at comparable levels 
of FO. Fig. 2-1 to 1-6 = sounds of the child; Fig. 2-7 to 2-13 = sounds of the woman; Fig. 
2-14 to 2-21 = sounds of the man. 
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Figure 3. Sounds of /u/ produced by a child, a woman and a man at corresponding FO. 
Fig. 1-1 to 1-6 = sounds of the child; Fig. 3-7 to 3-13 = sounds of the woman; Fig. 3-14 
to 3-21 = sounds of the man. 
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Figure 4. Sounds of /i/ produced by a child, a woman and a man at comparable levels 
of FO. Fig. 4-1 to 4-6 = sounds of the child; Fig. 4-7 to 4-13 = sounds of the woman, Fig. 
4-14 to 4-21 = sounds of the man. 
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Figure 5. Sounds of /a/ produced by a child, a woman and a man at comparable levels 
of FO. Fig. 5-1 to 5-6 = sounds of the child; Fig. 5-7 to 5-13 = sounds of the woman, Fig. 
5-14 to 5-21 = sounds of the man. 
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Figure 6. Sounds of /o/ produced by a child, a woman (untrained speaker) and a man 
(professional opera singer, baritone) at comparable levels of FO. Fig. 6-1 to 6-5 = sounds 
of the child; Fig. 6-6 to 6-11 = sounds of the woman; Fig. 6-12 to 6-17 = sounds of the man. 
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Figure 7. “Inverted” age- or size-related differences in the vowel-related lower spectral 
peak(s) and calculated F1 (and F2) for sounds of /o/ produced by a child and a man (see 
Fig. 7-1 to 7-3), and “expected” age- or size-related differences (see Fig. 7-4 and 7-5). 
Comparison of selected sounds of Figure 6. 
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Figure 8. “Inverted” age- or size-related differences in the vowel-related lower spectral 
peak and calculated F1 for sounds of /e/ produced by a child and a man (see Fig. 8-1 to 
8-3), and “expected” age- or size-related differences (see Fig. 8-4 and 8-5). Comparison 


of selected sounds of Figure 2. 
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Figure 9. “Inverted” age- or size-related differences in the vowel-related lower spectral 
peak(s) and calculated F1 (and F2) for sounds of /u/ produced by a child and a man (see 
Fig. 9-1 to 9-3), and “expected” age- or size-related differences (see Fig. 9-4 and 9-5). 
Comparison of selected sounds of Figure 3. 
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Figure 10. “Inverted” age- or size-related differences in the vowel-related lower spectral 
peak and calculated F1 for sounds of /i/ produced by a child and a man (see Fig. 10-1 to 
10-3) and “expected” age- or size-related differences (see Fig. 10-4 and 10-5). Compar- 
ison of selected sounds of Figure 4. 
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M10.2 The Dichotomy of the Vowel Spectrum 


In Chapter 10.1, we have argued that the spectrum of a vowel sound 
needs a twofold rather than a uniform consideration, because only the 
vowel-related spectrum < 1.5 kHz clearly depends on FO and, there- 
fore, is not generally specific to speaker groups and vocal-tract sizes. 
Figures 7 to 10 in the previous chapter illustrate this dichotomy of the 
vowel spectrum. 


M10.A Addition: Vowel Imitations by Birds 


The following series show examples of vowel sounds of common hill 
mynah birds (Gracula religiosa) imitating vocal expressions and words 
of humans. The examples are selected on the basis of extensive re- 
cordings of 21 birds, most of them living in Indonesia. (However, they 
imitated words of different languages.) The spectra presented relate to 
vowel nuclei extracted from the expressions or words. Both the entire 
imitated expressions or words as well as the extracted sound frag- 
ments are perceptually recognisable. 


In each of the series, the sound spectra are given in the order of the 
birds and of FO. (Note that in several cases, different sound spectra for 
the same vowel are shown for a bird, in order to document variations 
in FO and the sound spectra.)—Acoustic analysis corresponds to the 
analysis as described in the Note on the Method section. LPC filter 
curves relate to a parameter setting of the LPC analysis according to 
the PRAAT standard for women. However, as mentioned in the text, 
the LPC analysis is not methodically substantiated. 


Figure 11 Examples of sounds of imitated /i/ in word context pro- 
duced by five birds, with FO ranging from c. 110-380 Hz; 
perceptual vowel quality is /i/, including intermediate qual- 
ities /i-j/, /i-y/ and /i-e/ 

Figure 12 Examples of sounds of imitated /e/ in word context pro- 
duced by five birds, with FO ranging from c. 160-330 Hz; 
perceptual vowel quality is /e/, including intermediate qual- 
ities /e-i/ and /e-0/ 

Figure 13 Examples of sounds of imitated /a/ in word context pro- 
duced by twelve birds, with FO ranging from c. 110-490 Hz; 
perceptual vowel quality is /a—a/, including intermediate 
quality /a- 234 


238 Materials Part III 


Figure 14 Examples of sounds of imitated /o/ in word context pro- 
duced by eleven birds, with FO ranging from c. 80-410 Hz; 
perceptual vowel quality is /o/, including intermediate qual- 
itiy /0-0/ 

Figure 15 Examples of sounds of imitated /u/ in word context pro- 
duced by seven birds, with FO ranging from c. 110-660 Hz; 
perceptual vowel quality is /u/, including intermediate qual- 
ity /u-o/ 


Note that many of the sound spectra of these birds are similar to the 
vowel spectra of humans presented in the previous sections. However, 
for some examples of imitations of front vowels, the lower part of the 
spectral configuration < 1 kHz is “unexpected”. 
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Figure 11. Sounds of /i/ in word context imitated by mynah birds. 
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Figure 12. Sounds of /e/ in word context imitated by mynah birds. 
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Figure 13. Sounds of /a-a/ in word context imitated by mynah birds. 
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(Figure 13, continuation) 
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(Figure 13, continuation) 
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Figure 14. Sounds of /o/ in word context imitated by mynah birds. 
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Figure 15. Sounds of /u/ in word context imitated by mynah birds. 
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(Figure 15, continuation) 
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M11 Lack of Correlation between Metho- 
dological Limitations of Formant 
Determination and Limitations of Vowel 
Perception 


M11.1 Vowel Perception at Fundamental Frequencies > 350 Hz 


The sound series presented in Sections M8.1 and M8.2 demonstrate 
that recognisable vowels can be produced at fundamental frequencies 
substantially exceeding the critical limit above which formants can no 
longer be reliably determined for methodological reasons. 


M11.2 Lack of Correspondence between Methodological 
Problems of Formant Pattern Estimation at Fundamental 
Frequencies < 350 Hz and Impaired Vowel Perception 


The sound series presented in the Sections M7.1 and M7.2 demonstrate 
that vowel sounds produced at fundamental frequencies < 350 Hz, for 
which the estimation of formant patterns proves questionable for rea- 
sons other than fundamental frequency —for instance, if expected rel- 
ative spectral energy maxima are “missing” or if vowel-related parts of 
a spectrum spectra are “flat”—are not less recognisable than vowel 
sounds for which formant pattern estimation may be said to be un- 
problematic. 
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Experiments 


The treatise concludes with a list of possible experiments that 
allow for empirical exploration of the problems discussed here under 


laboratory conditions. 
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E1 Number of Relative Spectral Energy 
Maxima and Number of Formants 


E1.1 Sounds of Back Vowels Showing only One Lower Spectral 
Peak <1.5kHz 


To do: (i) Find examples of sounds of back vowels, produced as voiced 
sounds in isolation, which show only one spectral peak <1.5 kHz. (ii) 
Perform a listening test. 


Note: For most of the corresponding examples, LPC analysis yields 
two formants < 1.5 kHz; however, you will find that the second formant 
is often weak (large second formant bandwidth, low second formant 
level). You also will find examples for which LPC analysis yields only 
one lower formant frequency. (Long vowels produced in some lan- 
guages, such as Standard German, are particularly suited for such an 
experiment.) 


Option: You may also perform resynthesis and perform a related sec- 
ond listening test. 


Thesis: You will find many examples for which the vowel identification 
score is high. 


Examples: See Section M7.1, Figures 1 to 3. 


E1.2 Sounds of Back Vowels Showing only One Pronounced 
Lower Formant<1.5kHz 


To do: (i) From the sample investigated in the previous experiment, se- 
lect examples of sounds of back vowels for which LPC analysis gives 
a weak second formant (high bandwidth, low level). (ii) Manipulate 
these sounds in terms of shaping the spectrum using bandpass filter- 
ing including filter slope variation, until LPC analysis gives only one 
formant < 1.5 kHz. (iii) Perform a listening test. 


Thesis: You will find examples for which the perceived vowel quality 
proves to be maintained for the manipulated sounds. 
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E1.3 Sounds of Single Front Vowels Showing Non-Corresponding 
F2 and F3 


To do: (i) Find examples of sound pairs of the same intended front 
vowel, produced as voiced sounds in isolation at similar FO, for which 
F2 of the first sound is near or above F3 of the second sound. (ii) Per- 
form a listening test. 


Option: You may compare sounds produced by speakers of the same 
age and gender group as well as of different groups. You may also per- 
form resynthesis, and perform a related second listening test. You may 
also investigate the roles of the higher formants in bandpass filtering 
single formants. 


Thesis: You will find such examples of sound pairs equal in perceived 
vowel quality. 


Examples: See Section M7.1, Figures 8 to 10. 


E1.4 Sounds of Back Vowels Showing No Pronounced Spectral 
Peak <1.5kHz 


To do: (i) Find examples of sounds of back vowels, produced as voiced 
sounds in isolation, which show no pronounced spectral peak < 1.5 kHz 
apart from the fundamental (“flat” spectra, or spectra exhibiting con- 
tinuously decreasing amplitudes of the harmonics). (ii) Perform a lis- 
tening test. 


Thesis: You will find examples for which the score of vowel identifi- 
cation is high. Further, you may experience examples for which the 
calculation of F1-F2 depends on rather small amplitude variations of 
the first harmonics. 


Examples: See Section M7.2, Figures 11 and 12. 
E1.5 Sounds of Front Vowels Showing No Pronounced Spectral 
Peak>2kHz 


To do: (i) Find examples of sounds of front vowels, produced as voiced 
sounds in isolation, which show no pronounced spectral peak > 2 kHz. 
(ii) Perform a listening test. 


Thesis: You will find examples for which the vowel identification score 
is high. 


Examples: See Section M7.2, Figures 13 and 14. 
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E2 Patterns of Relative Spectral Energy 
Maxima, Formant Patterns and Funda- 
mental Frequency 


E2.1 Sounds of Single Vowels Produced at Different FO 
Exhibiting Different Spectral Peaks and Different Calcu- 
lated Formant Patterns: Part 1, Dependence of Formant 
Patterns on FO 


To do: (i) Select speakers with excellent vocal abilities. (ii) Investigate 
all long vowels of the language in question. (iii) Let the speakers pro- 
duce single words (including word pairs forming minimal pairs), single 
syllables (including logatomes) and isolated vowel sounds for their en- 
tire range of FO of possible vowel production. (iv) Perform a listening 
test. (v) Only select sounds with a high identification score. (vi) Perform 
spectral analysis and LPC analysis. 


Options: You may need to train the speakers so as they indeed main- 
tain the perceived vowel while altering FO. You may select professional 
singers, actresses and actors. You may give special attention to the 
entertainment sector, including voice-over. You may vary vocal effort. 
You may include resynthesis. You may also extract words or syllables 
or vowel nuclei from existing recordings. 


Thesis: (i) You will obtain unsystematic results, above all depending 
on single speakers, FO levels and vocal effort, frequency ranges of 
spectral peaks and formants, vowel qualities and additional spectral 
characteristics of the original sounds. (ii) However, for FO>200, the 
spectral peaks and the calculated lower formants will shift with raising 
FO for a substantial part of your sample even if the perceived vowel 
quality remains the same. (iii) Whether or not you experience a sys- 
tematic (and not speaker-related) impact of the syntactic or semantic 
context of the vowel sounds is left open here. 


Examples: See Section M8.1, Figures 1 to 5. 
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E2.2 Sounds of Single Vowels Produced at Different FO Exhibiting 
Different Spectral Peaks and Different Calculated Formant 
Patterns: Part 2, Vowel Intelligibility for Sounds at FO>500Hz 


To do: (i) Refer to the sounds of the previous experiment. (ii) Select the 
sounds at FO >500 Hz. 


Thesis: (i) You will obtain different results related to the abilities and 
production styles or habits of the speakers. (ii) However, you will ob- 
serve possible vowel perception up to FO corresponding to the upper 
frequency limit of F1 for men and women as given in formant statistics. 


Examples: See Section M8.1, Figures 2 and 3, and Section M8.2, Figures 
6 and 7; see also the pitch contours in Section M8.2, Figures 8 to 10. 


E2.3 Sounds of Single Vowels Produced at Different FO Exhibiting 
Different Spectral Peaks and Different Calculated Formant 
Patterns: Part 3, Resynthesising a Formant Pattern at 
Different FO 


To do: (i) Refer to the sounds experiment E2.1. (ii) Select two sounds 
of one vowel exhibiting very different FO and different spectral peaks 
or (lower) formants, respectively. (iii) Concatenate these two sounds 
and insert a pause between them. Eventually, equalise loudness. (iv) 
Perform resynthesis of the concatenated sound, applying three con- 
ditions for FO. Firstly, use FO of the original sounds; secondly, fix FO to 
the value of the original sound at lower FO; thirdly, fix FO to the value 
of the original sound at higher FO. (v) Perform a listening test including 
all sound pairs. 


Options: Instead of concatenating two sounds, a singer or speaker 
with high vocal ability may perform a glissando, and resynthesis is per- 
formed at original (altering) FO, fixed FO corresponding to the lowest, 
and fixed FO corresponding to the highest FO values of the original 
sound. However, during the production of the glissando, the vowel 
quality must be strictly maintained. 


Thesis: (i) You will obtain unsystematic results (see above). (ii) How- 
ever, you will find many cases for which the original sounds of a pair 
as well as the resynthesised sounds, for which the first condition men- 
tioned applies, are perceived as the same vowel, but the resynthesis 
applying the second and third condition produces a change in vowel 
perception between the two sounds of a pair. 
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Spectral Peaks and Different Calculated Formant Patterns: Part 2 


E2.4 Sounds of Single Back Vowels Produced at Different FO 
Exhibiting Inverse Spectral Peaks 


To do: Refer to experiment E2.1 but, in particular, consider sound pairs 
of a back vowel which differ in FO and exhibit an “inversion” of spectral 
peaks, that is, the first relative spectral energy maximum (correspond- 
ing to its F1) for the sound at higher FO is found at a frequency level of 
a relative spectral minimum for the sound at lower FO, in between the 
first and second spectral peak (in between the F1 and F2) of the latter. 
Consider also resynthesis and identification scores. 


Thesis: You will find many cases for which the sounds of such pairs are 
perceived as the same vowel. 


Examples: See Section M8.3, Figures 11 to 13. 


E2.5 Special Note Concerning Inconstant Numerical Relation- 
ship between Calculated FO and Formant Patterns 


To do: (i) Refer to sounds at very different FO, above all to sounds of 
the vowel Je, oi Include sounds produced with different vocal effort. 
(ii) Perform a listening test. (iii) Select only sounds with a high iden- 
tification score. (iv) Calculate formant patterns for these sounds. (v) 
Perform resynthesis. (vi) Perform a listening test with the resynthesised 
sounds. 


Thesis: (i) You may observe sound pairs for which F1 or F1-F2 of the 
sound at lower FO is higher than F1 or F1-F2 at higher FO, thus seem- 
ingly indicating an “inverse” dependence of lower formants and FO. (ii) 
You may also note that resynthesis seems to confirm this observation. 
(iii) However, you will have to relate such observations to a limited 
frequency range of FO, differences in vocal effort may have a strong 
influence on formant estimation and you will have to consider method- 
ological aspects of LPC analysis. 


Examples: See Section M8.1, Figure 4 for an indication. 
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E3 Formant Pattern Ambiguity 


E3.1 Formant Pattern Ambiguity in Natural Vocalisations 


To do: (i) Select speakers from all three speaker groups with excellent 
vocal abilities. (ii) Let them produce isolated sounds of long vowels at 
very different FO. Vary vocal effort, for example medium, low and high 
vocal effort. Investigate a frequency range of FO of 220 to 700 HZ for 
children, 175 to 880 Hz for women and 110 to 523 Hz for men. Inves- 
tigate different FO step by step (you may refer to a musical scale). (iii) 
Perform spectral analysis and formant pattern analysis. With regard to 
the latter, you may perform the analysis also for FO>350Hz even if there 
is a lack of methodological substantiation. (iv) Perform a listening test. 


Thesis: (i) You will find unsystematic results (See above). (ii) How- 
ever, comparing the vowel-related patterns of spectral peaks and of 
formants of the sounds of a single speaker, you will find many exam- 
ples of similar patterns for sounds at different FO and two different 
perceived vowel qualities. You may even encounter examples of such 
patterns for three vowels. (iii) The same holds true in an extended way 
for a corresponding comparison of the sounds of different speakers. 
(iv) You will not be able, in general terms, to directly relate such a pat- 
tern ambiguity to differences in speaker group or vocal effort. 


Examples: See Section M9.1, Figures 1 to 21. 


E3.2 Formant Pattern Ambiguity in Model Synthesis 


To do: (i) Refer to the sounds in experiment E3.1. (ii) Select sounds 
of different vowels for which—apart from differences in FO and the 
frequency distance of the harmonics—a direct comparison of the vow- 
el-related spectral region as well as the corresponding spectral peaks 
and formant patterns can be considered similar, according to prevail- 
ing consideration in phonetics. (iii) Use the related formant patterns 
(including formant bandwidths) as models for vowel synthesis. (iv) Per- 
form vowel synthesis for the entire range of the FO you have investigat- 
ed in the previous experiment. (v) Perform a listening test. 


Thesis: You will observe that, for selected formant patterns of natural 
vocalisations that prove to be ambiguous in vowel representation, the 
alteration of FO in such a model synthesis generally produces a clear 
and sometimes very pronounced change in perceived vowel quality. 
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E4 Patterns of Relative Spectral Energy 
Maxima, Formant Patterns and Age- 
and Gender-Related Vocal-Tract Sizes 


E4.1 Comparison of Vowel-Specific Spectral Characteristics 
of Children, Women and Men Related to Different and 
Similar FO of Vocalisations: Part 1, Natural Vocalisations 


To do: (i) Select a child, a woman and a man with excellent vocal abil- 
ities. (ii) Let them produce isolated sounds of long vowels at different 
FO according to the C-major scale, for example starting from 220Hz 
for the child, from 175Hz for the woman and from 131 Hz for the man. 
Investigate a range of FO up to 523Hz. Ensure that the sounds corre- 
spond with each other perceptually, not only in vowel quality but also 
in “vowel-colour” variant, which makes for the greatest possible corre- 
spondence as regards perception (exclusion of age- and gender-relat- 
ed “dialects”). (iii) Perform a listening test. (iv) Perform spectral analy- 
sis and compare the spectra and the spectral peaks of the sounds of a 
singe vowel. (iv) In parallel, perform formant analysis and compare the 
formant patterns of the sounds of a single vowel. 


Option: You may proceed in a similar way with several speakers from 
the three speaker groups as to re-examine formant statistics. Howev- 
er, you will not be able to control the correspondences of the vowel 
qualities as precisely as in an investigation of the utterances of three 
single speakers. 


Thesis: (i) With regard to the spectral characteristics in general and 
the spectral peaks in particular, you will find the expected differences 
which are in line with the numbers given in formant statistics for cita- 
tion-form words, if the FO of the sounds also concurs with the FO of 
the statistics in question, that is, c. 262Hz for the child, c. 220Hz for 
the woman and c. 131 Hz for the man (levels given according to the 
C-major scale). (ii) However, you will observe that spectral differenc- 
es<1.5kHz decrease or disappear if the speakers vocalise at a similar 
FO. (iii) You will even observe cases of “inversions” of expected age- 
and gender-related spectral differences in terms of higher spectral 
peaks < 1.5kHz for the sounds of the two adults than for the sounds 
of the child, if the FO of the former are also higher than of the latter. 
The same will hold true for the comparison of sounds of the man with 
sounds of the woman at correspondingly different FO. (iv) With regard 
to calculated formant patterns, you will observe similar behaviour. How- 
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ever, methodological problems of analysis will interfere. (v) With regard 
to formant statistics, you will not be able to resolve the methodologi- 
cal problem of formant pattern analysis at FO>350Hz. Moreover, you 
will have to consider possible age- and gender-related vowel colouring 
(age- and gender-related “dialects”). However, for sounds > 220 Hz, 
you will no longer find a clear indication of generalised age- and gen- 
der-related formant patterns < 1.5 kHz, if the FO of the sounds corre- 
spond. 


Examples: See Section M10.1, Figures 1 to 10. 


E4.2 Comparison of Vowel-Specific Spectral Characteristics of 
Children, Women and Men Related to Different and Similar 
FO of Vocalisations: Part 2, Resynthesis 


To do: (i) Select the sounds of the three single speakers of the previous 
experiment. (ii) Resynthesise them on the basis of formant analysis 
but, for each single formant pattern of a single vocalisation, perform 
resynthesis for all FO levels on which the speaker produced vowel 
sounds. (iii) Perform a listening test. 


Thesis: (i) If resynthesis is performed applying FO and formant pat- 
terns of the original sounds, in general, the perceived vowel quality 
will not change. (ii) If only the formant patterns correspond to the orig- 
inal sounds but FO is varied according to the FO-range of the natural 
sounds, you will obtain unsystematic results (see above). However, for 
some of the vowels investigated and for FO of the sounds > 200 Hz, for 
all three speakers, you will find many examples of sounds for which the 
perceived vowel quality changes with changing FO. 
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E5 Patterns of Relative Spectral Energy 
Maxima, Formant Patterns and Phonation 
Types 


E5.1 Whispered Sounds Compared with Voiced Sounds 
at Different FO in Utterances of a Single Speaker 


To do: (i) Select a speaker with good vocal abilities. (ii) Let the speaker 
produce isolated whispered sounds of the long vowels of his language. 
(iii) Then, let the speaker produce voiced sounds of these vowels at 
different levels of FO. (You may refer to a musical scale). Investigate a 
range of FO up to 523Hz in minimum. (You may refer to utterances of 
a woman.) Pay attention to the close correspondence of the produced 
vowel qualities and vowel colours. (iv) Perform spectral analysis and 
formant analysis. (v) Perform resynthesis, according to the following 
conditions: for a given formant pattern of a single sound produced, 
as source characteristic, apply all FO investigated as well as noise. (vi) 
Perform a listening test. 


Thesis: (i) You will find unsystematic results (See above). (ii) When 
comparing whispered sounds with voiced sounds at lower FO, in many 
cases, you will find indications of higher spectral peaks <1.5kHz and 
higher frequencies of calculated F1 and F2 for the former than the lat- 
ter, as is indicated in formant statistics for citation-form words. (iii) 
However, you will also find many cases in which such differences de- 
crease or even disappear if the FO of the voiced vowel sound is raised. 
(iv) In parallel, often, no change in vowel perception will be found for a 
resynthesis using formant patterns of whispered sounds but higher FO 
of voiced sounds. (v) In parallel, as metioned above, a change in vowel 
perception will often be found for resynthesising formant patterns of 
voiced sounds with regard to all FO investigated. 
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E5.2 Whispered Sounds Compared with Voiced Sounds 
at Different FO in Utterances of Speakers of Different 
Speaker Groups 


To do: Redo the previous experiment for three speakers, a child, a wom- 
an and a man. (You may refer to the three speakers and the sounds of 
experiment E4.1.) 


Thesis: In addition to the results predicted for experiment E4.1, you 
can question the so-called speaker group differences. Above all, for a 
given vowel, you may find correspondences of formant patterns of a 
voiced sound of a child when compared to a whispered sound of an 
adult, and vice versa. 


E5.3 Sounds of Back Vowels Showing Three Spectral 
Peaks<1.5kHz 


To do: (i) Search for examples of sounds of back vowels, produced 
as whispered sounds in isolation, which show three spectral peaks 
<1.5kHz. Also search for correspondingly produced examples that 
only show two peaks < 1.5 kHz. (ii) Perform a listening test. 


Thesis: You will find examples of the first kind for which the identifica- 
tion score is as high as for the examples of the second kind. 


E5.4 Sounds of Front Vowels Showing Two Spectral 
Peaks<1.5kHz 


To do: (i) Search for examples of sounds of front vowels, produced as 
whispered sounds in isolation, which show two spectral peaks < 1.5 kHz. 
Also search for correspondingly produced examples that show only 
one peak<1.5kHz. (ii) Perform a listening test. 


Thesis: You will find examples of the first kind for which the identifica- 
tion score is as high as for the examples of the second kind. 
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E6 Patterns of Relative Spectral Energy 
Maxima, Formant Patterns and Vowel 
Imitation by Birds 


E6.1 Direct Comparisons of Selected Sounds of Humans 
and Birds 


To do: (i) Create a sample of imitated words produced by birds, for 
example common hill myna birds. (ii) Select the best examples with 
regard to intelligibility. (iii) Isolate the sound nuclei corresponding to 
a vowel. (iv) Perform a listening test. (v) Select the sounds with a high 
score of consistent vowel perception. (vi) Let a woman and a man in 
turn imitate the “words” of the birds at the corresponding FO, and iso- 
late the vowel sound nuclei. (vii) Perform a second listening test for 
all the sounds compared. (viii) Perform spectral analysis and formant 
pattern analysis. Concerning the sounds of the birds, even if method- 
ologically unsubstantiated, you may apply both standard parameter 
settings for females and for males. 


Thesis: (i) You will be able to observe examples in which a bird can 
produce a sound with a formant pattern F1-F2-F3 that corresponds 
to the formant pattern of a woman or a man. (ii) You will also be able 
to observe examples for which the sound of a bird does exhibit only a 
part of the formant patterns produced by the woman or man, yet vowel 
perception is not impaired. 


E6.2 Resynthesis Relating to “Anomalous” Formant Patterns 
of Sounds of Birds 


To do: (i) Select the sounds of the birds of the previous experiment with 
intelligible vowel quality but only partial correspondence of the formant 
patterns compared with the sounds of the man or the woman. (ii) Per- 
form resynthesis. (iii) Perform a listening test. 


Thesis: You will be able to resynthesise these sounds related to “anom- 
alous” formant patterns with no substantial change in perceived vow- 
el quality compared with the natural sounds. 


262 Experiments 


E7 Anomalous Vowel Spectra 


E7.1 Spectra with Increasing Number of Harmonics Equal 
in Amplitude (“Flat” Vowel Spectra) 


To do: (i) Perform vowel synthesis using a harmonic synthesiser, that 
is, create harmonic spectra, perform inverse Fourier analysis and re- 
peat the periods obtained over time for a certain duration, for example 
1 s. (ii) Investigate sounds at FO of 110Hz and 220Hz separately. (iii) 
Start a synthesis with only the first harmonic or fundamental at a given 
FO. Then continue to add, step by step, harmonics 2, 3, 4, etc. equal in 
amplitude to the fundamental. (iv) Perform a listening test. 


Option: You may also investigate FO other than the two frequency lev- 
els mentioned. 


Thesis: You will find some sounds in the sound series created for which 
the listening test gives a vowel identification of one of the vowels /u/, 
/o/, /o/ and /a/. Eventually, /e/ is also perceived. 


Extension: You may extend the investigation to front vowels concerning 
“flat” spectral parts >c. 2 KHz. (Try also “flat” spectral parts >c. 1.5 kHz.) 
You may then start with a series of lower harmonics as found in natu- 
ral vocalisations and add, step by step, harmonics equal in amplitude 
from c. 2 kHz (or from c. 1.5 kHz) upwards. 


E7.2 Spectra with Increasing Number of Harmonic Pairs 
Showing Equal Amplitude Differences (“Ridged” Parts 
of Vowel Spectra) 


To do: Apply the same procedure as described in the previous experiment 
but add, step-by-step, harmonics with periodic increasing and de- 
creasing amplitudes; for example L2 (level of second amplitude) <L1 
(level of first amplitude), L3 = L1, L4 = L2, and so on; or vice versa. 


Thesis: (i) You will obtain results depending on the extent of the differ- 
ence in the harmonic level you have set. (ii) However, within a limited 
range of such a difference, the listening test will provide similar results 
to those predicted for the previous experiment. 


Extension: You may again extend the investigation to front vowels con- 
cerning “ridged” spectral parts > 2 kHz. 
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E8 Aspects of Method 


E8.1 Formant Pattern Estimation Related to Non-Standard 
Parameters 


To do: (i) Refer to a large sample of isolated vowel sounds. (ii) Perform 
LPC analysis applying standard parameters. (iii) Select the sounds for 
which the calculated formant patterns clearly do not correspond to 
what is expected if referred to formant statistics. However, do not try to 
include very high FO. (iv) Perform a listening test. (v) Select only sounds 
with a high score of identification. (vi) Perform LPC analysis again but 
alter the parameters. 


Option: You may also perform resynthesis. 


Thesis: (i) You will find various examples for which LPC analysis based 
on non-standard parameters as given in the literature—above all based 
on anon-standard maximum number of formants for a given frequency 
range, which is usually related to age and gender of the speaker— 
provides “better” (that is, more “expected”) results than LPC analysis 
based on standard parameters. (ii) However, you will not be able to 
relate this finding to a general production characteristic for all vowel 
sounds produced by a single speaker. 


E8.2 Formant Pattern Estimation at FO>350Hz 


To do: (i) Select isolated vowel sounds produced at FO >350 Hz. (ii) 
Perform a listening test. (iii) Select sounds with a high identification 
score. (iv) Perform LPC analysis using standard parameters. (v) On the 
basis of the corresponding results, perform resynthesis. (vi) Perform a 
listening test related to the resynthesised sounds. 


Thesis: (i) You will find variable results. (ii) However, you will find many 
examples for which the natural and the resynthesised sound is per- 
ceived as the same vowel, although the LPC analysis is not method- 
ologically substantiated and the calculated formant pattern may differ 
strongly from values given in formant statistics. 
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E8.3 Resynthesis of Sounds at Varying FO and Subsequent 
Formant Pattern Estimation 


To do: (i) Select isolated natural vowel sounds. (ii) Perform a listen- 
ing test. (iii) Select only sounds with a high identification score. (iv) 
Perform LPC analysis using standard parameters. (v) On the basis of 
the corresponding results, perform resynthesis for two conditions; first, 
use FO of the natural vocalisation; second, use a very different FO level. 
(vi) Perform a listening test again and select again only sounds with a 
high identification score. (7) Perform LPC analysis for both types of the 
resynthesised sounds. 


Thesis: (i) You will find variable results. (ii) However, you will find many 
examples for which the calculated formant pattern of the resynthesised 
sounds differs substantially from the original formant pattern, if the FO 
of the resynthesised and the natural sounds also differs substantially. 
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Prevailing Theory 
Illustration 
Figure 1. Illustration of prevailing theory. 


Figure 2. Examples of sounds of different vowels produced 
in isolation by adult male speakers at fundamental frequencies 
of 120-140 Hz. 


Prevailing Empirical References 
General References 


Figure 1. Illustration of the distribution of the first two formants 
for American English vowels (Peterson & Barney, 1952). 


Unsystematic Correspondence between Vowels, Patterns 
of Relative Spectral Energy Maxima and Formant Patterns 


Inconstant Number of Vowel-Specific Relative Spectral 
Energy Maxima and Incongruence of Vowel-Specific 
Formant Patterns 


Figures 1 to 3. Sounds of /a—a, o, u/ which exhibit only one 
relative spectral energy maximum within their vowel-specific 
frequency range<c. 1.5 kHz. 

Figure 1. Sounds produced by children. 

Figure 2. Sounds produced by women. 

Figure 3. Sounds produced by men. 


Figures 4 to 6. Sounds of /a—a, o, u/ which exhibit two relative 
spectral energy maxima within their vowel-specific frequency 
range<c. 1.5kHz. 

Figure 4. Sounds produced by children. 

Figure 5. Sounds produced by women. 

Figure 6. Sounds produced by men. 


Figure 7. Direct comparisons of sounds of back vowels 
with one or two relative spectral energy maxima<c. 1.5 kHz. 
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Figure 8 and 9. Sound pairs of /i/ and of /e/, each pair 
produced by speakers of one and the same age- and gender- 
related speaker group, with small differences in FO and F1 but 
substantial differences in the higher vowel-related spectral 
range. 

Figure 8. Sounds pairs of /i/. 

Figure 9. Sounds pairs of /e/. 


Figure 10. A sound pair of /i/ and a corresponding pair of /e/, 
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small differences in FO and F1 but very pronounced differences 
in the higher vowel-related spectral ranges. 
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Figure 11 to 12. Sounds of /a—a, o/, produced by children, 
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spectral portions <c. 1.5 kHz lacking a clearly determinable 
vowel-related peak. 

Figure 11. Sounds of /a-a/. 

Figure 12. Sounds of /o/. 


Figure 13 to 14. Sounds of /i, e/, produced by children, 
women and men, which exhibit “flat” or “sloping” spectral 
portions in the frequency range of 1.5-5 KHz lacking a clearly 
determinable pattern of vowel-related peaks. 

Figure 13. Sounds of /i/. 

Figure 14. Sounds of /e/. 
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Dependence of Vowel-Specific, Relative Spectral Energy 
Maxima and Lower Formants < 1.5 kHz on Fundamental 
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Figure 1 and 2. Sounds of /o, o, e/ and of /u, y, i/, produced 
by single speakers at different FO, which indicate a shift of the 
lowest spectral peak as well as of calculated F1 with rising FO. 
Figure 1. Sounds of Jo. e, e/ produced by a woman (/o/), 

a man (/o/) and a child (/e/). 

Figure 2. Sounds of /u, y, i/ produced by a woman (/u/), 

a child (/y/ and another woman (/i/). 
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Figure 3. Sounds of /a-a/, produced at different FO by a child, 
for which there is no clear indication of a relation between FO 
and the lower spectral envelope. 


Figure 4. Sounds of /o/, produced at different FO by a woman, 
for which only a very weak indication of a relation between FO 
and the lower spectrum is manifest. 


Figure 5. Three sound pairs of /o, u/ and three sound pairs of 
/e, i/, produced by children, women and men, exhibiting 

a higher first spectral peak frequency for /u/ than for /o/, and 
for /i/ than for /e/, respectively. 
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Figure 6. Five intelligible sounds of /y, e, a, €, o/ produced 
by children and women at FO in the range of 700-800 Hz. 


Figure 7. Three intelligible sounds of the corner vowels /i, a, u/ 
produced by women at FO of c. 850 Hz. 


Figure 8. Pitch contours of speech extracts produced 
by untrained speakers, journalists, TV hosts and actresses 
talking on TV (not acting), to experience in every day life. 


Figure 9 and 10. Pitch contours of extracts of speech produced 
by actresses and actors while performing (film, comic, voice- 
over, dubbing). 

Figure 9. Pitch contours of actresses. 

Figure 10. Pitch contours of actors. 
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and Minima and “Inverse” Formant Patterns in Sounds 
of Individual Vowels 


Figure 11 to 13. Sounds of /a—a, o, u/, produced at different 
FO by children, women and men, which exhibit “inverse” 
relative spectral maxima and minima in terms of “inverse” 
spectral envelope curves < 1.5 kHz. 

Figure 11. Sounds of /a-a/. 

Figure 12. Sounds of /o/. 

Figure 13. Sounds of /u/. 
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Figures 1 to 3. Comparisons of sounds of back vowels and of 
/a-a/ produced by different speakers, and related model pat- 
terns of spectral peaks and/or of calculated formant frequencies. 
Figure 1. Sounds of /a—a, o, u/; related model pattern 
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Figure 2. Sounds of /a-a, o, u/; related model pattern 
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Figure 3. Sounds of /a-a, o, u/; related model pattern 
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Figures 4 to 8. Comparisons of sounds of back vowels and of 
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Figure 4. Three comparisons of sounds of /a-a, o, u/ produced 
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by a man (sounds sung by a tenor); related model pattern 

= 600-1200 Hz for the first comparison; similar spectral peaks 
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Figure 6. Sounds of /a-a/ and of /u/, produced by a woman, 
which exhibit comparable spectral envelopes < 1.5 kHz. 

Figure 7. Sounds of /9, o, u/ produced by a woman; related 
model = one clear peak at c. 550 Hz. 

Figure 8. Two comparisons of sounds of /o, u/ produced by 
two children (age 12 and 6); related model patterns = one clear 
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Figures 9 to 18. Comparisons of sounds of front vowels pro- 
duced by different speakers, and related model patterns of 
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Figure 10. Sounds of /@, e, y, i/; related model pattern 
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Figure 12. Sounds of /e, e, i/; related model pattern 
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Figure 24. A sound pair of Je, ei produced by a single male 
speaker; model pattern of spectral peaks and/or of calculated 
formant frequencies = 350-1700 Hz. 


List of Figures 


M10 Lack of Correspondence between Patterns of Relative 
Spectral Energy Maxima or Formant Patterns and Age- 
and Gender-Related Speaker Groups or Vocal-Tract Sizes 


M10.1 Similar Patterns of Relative Spectral Maxima and Similar 
Formant Patterns < 1.5 kHz for Different Age- 
and Gender-Related Speaker Groups or Vocal-Tract Sizes 


Figure 1 to 6. Comparisons of sounds produced by single 
children, women and men at comparable levels of FO. 
222 Figure 1. Sounds of /o/. 
224 Figure 2. Sounds of /e/. 
226 Figure 3. Sounds of /u/. 
228 Figure 4. Sounds of /i/. 
230 Figure 5. Sounds of /a/. 
232 Figure 6. Sounds of /o/, including vocalisations 
of a professional opera singer (baritone). 


Figure 7 to 10. “Inverted” age- or size-related differences in 
vowel-related lower spectral peak(s) and calculated F1 (and F2) 
for sounds produced by single children and men. 

234 Figure 7. Sounds of /o/. 

235 Figure 8. Sounds of /e/. 

236 Figure 9. Sounds of /u/. 

237 Figure 10. Sounds of /i/. 


M10.A Addition: Vowel Imitations by Birds 


Figure 11 to 16. Vowel sounds in word context imitated by 
mynah birds. 

240 Figure 11. Sounds of /i/. 

241 Figure 12. Sounds of /e/. 

242 Figure 13. Sounds of /a-a/. 

245 Figure 14. Sounds of /o/. 

247 Figure 15. Sounds of /u/. 


List of Figures 273 


List of Tables 


2 Prevailing Empirical References 
2.1 General References 


22 Table 1. Formant statistics for American English vowels (Peter- 
son & Barney, 1952). 


24 Table 2. Formant statistics for American English vowels (Hillen- 
brand et al., 1995). 


25 Table 3. Formant statistics for Swedish vowels (Fant, 1959). 
2.2 Empirical Reference for Standard German 


28 Table 4. Formant statistics for Standard German vowels 
(Patzold and Simpson, 1997). 


274 List of Tables 


References 


Boersma, P., & Weenink, D. (2015). Praat: doing phonetics by comput- 
er [Computer program]. Version 5.4.08. Retrieved March 30, 2015, 
from http://www.praat.org. 


Delattre, P. (1980). Vowel color and voice quality. In J. Large (Ed.), Con- 
tributions of Voice Research to Singing (pp. 373 —384). Houston, TX: 
College Hill Press. (Reprinted from The Bulletin of the National As- 
sociation of Teachers of Singing, 1958, XV, 4-7.) 


Diehl, R. L., Lindblom, B., Hoemeke, K. A., & Fahey, R. P. (1996). On ex- 
plaining certain male-female differences in the phonetic realization 
of vowel categories. Journal of Phonetics, 24(2), 187-208. 


Fant, G. (1959). Acoustic analysis and synthesis of speech with appli- 
cations to Swedish. Ericsson Technics, 1, 1-106. 


Fant, G. (1960). Acoustic theory of speech production. The Hague: 
Mouton. 


Fant, G., Carlson, R., & Granström, B. (1974). The [e]-[o] ambiguity. 
In Speech Communication Seminar (pp. 117-121). 


Fant, G., Henningsson, G., & Stalhammar, U. (1969). Formant frequen- 
cies of Swedish vowels. Speech Transmission Laboratory Quarterly 
Progress and Status Report, 10(4), 26-31. 


Friedrichs, D., Maurer, D., & Dellwo, V. (2015). The phonological func- 
tion of vowels is maintained at fundamental frequencies up to 880 
Hz. The Journal of the Acoustical Society of America, 138(1), EL36— 
EL42. 


Friedrichs, D., Maurer, D., Suter, H., & Dellwo, V. (2015). Vowel identifi- 
cation at high fundamental frequencies in minimal pairs. In Proceed- 
ings of the 18th International Congress of Phonetic Sciences (no. 
0434, pp. 1-4). 


Fulop, S. A. (2011). Speech spectrum analysis. Berlin: Springer Science 
& Business Media. 


Gelfer, M. P, & Bennett, Q. E (2013). Speaking fundamental frequen- 
cy and vowel formant frequencies: Effects on perception of gender. 
Journal of Voice, 27(5), 556-566. 


References 275 


von Helmholtz, H. L. F. (1954). On the sensations of tone. New York, 
NY: Dover. (Republication of the 2nd edition of the Ellis translation of 
Die Lehre von den Tonempfindungen, Longman & Co., 1885.) 


Hillenbrand, J. (n.d.). The physics of sound. Retrieved October 1, 2015, 
from http://homepages.wmich.edu/-hillenbr/206/ac.pdf 


Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acous- 
tic characteristics of American English vowels. The Journal of the 
Acoustical Society of America, 97(5), 3099-3111. 


Hollien, H., Mendes-Schwartz, A. P., 8 Nielsen, K. (2000). Perceptu- 
al confusions of high-pitched sung vowels. Journal of Voice, 14(2), 
287-298. 


Howie, J., 8 Delattre, P. (1962). An experimental study of the effect 
of pitch on the intelligibility of vowels. The National Association of 
Teachers of Singing Bulletin, 18(4), 6-9. 


livonen, A. (1986). A set of German stressed monophthongs analyzed 
by RTA, FFT, and LPC. In R. Channon & L. Shockey (Eds.), In honour 
of Ilse Lehiste (pp. 125-138). Dordrecht: Foris. 


livonen, A. (1970). Experimente zur Erklirung der spektralen Variation 
deutscher Phonemrealisationen (Commentationes Humanarum Lit- 
terarum, vol. 45). Helsinki: Societas Scientiarum Fennica. 


Joliveau, E., Smith, J., & Wolfe, J. (2004). Vocal tract resonances in 
singing: The soprano voice. The Journal of the Acoustical Society of 
America, 116(4), 2434-2439. 


Jørgensen, H. P. (1969). Die gespannten und ungespannten Vokale in 
der norddeutschen Hochsprache mit einer spezifischen Untersu- 
chung der Struktur ihrer Formantfrequenzen. Phonetica, 19, 217-245. 


Kent, R. D., & Read, C. (2002). The acoustic analysis of speech (2nd 
ed.). Clifton Park, NY: Delmar, Cengage Learning. 


Ladefoged, P. (1996). Elements of acoustic phonetics (2nd ed.). Chicago: 
The University of Chicago Press. 


Ladefoged, P. (2003). Phonetic data analysis: An introduction to field- 
work and instrumental techniques. Malden, MA: Wiley-Blackwell. 


Maurer, D. (2013). Akustik des Vokals — Praliminarien. subTexte 08, A. 
Rey (Ed.). Zurich: Institute for the Performing Arts and Film, Zurich 
University of the Arts. 


276 References 


Maurer, D. (n.d.). Acoustic characteristics of voice in music and straight 
theatre — towards a systematic empirical foundation. Project descrip- 
tion. Retrieved October 1, 2015, from 
http://www.phones-and-phonemes.org/project-1.html. 


Maurer, D., Cook, N., Landis, T., 8. d Heureuse, C. (1991). Are measured 
differences between the formants of men, women and children due 
to FO differences? Journal of the International Phonetic Association, 
21(2), 66-79. 


Maurer, D., & Landis, T. (1995). FO-dependence, number alteration, and 
non-systematic behaviour of the formants in German vowels. Inter- 
national Journal of Neuroscience, 83(1-2), 25-44. 


Maurer, D., & Landis, T. (1996). Intelligibility and spectral differences in 
high-pitched vowels. Folia Phoniatrica et Logopaedica, 48(1), 1-10. 


Maurer, D., & Landis, T. (2000). Formant pattern ambiguity of vowel 
sounds. International Journal of Neuroscience, 100(1-4), 39-76. 


Maurer, D., Landis, T., & d’Heureuse, C. (1991). Formant movement and 
formant number alteration with rising FO in real vocalisations of the 
German vowels [u:], [o:] and [a:]. International Journal of Neurosci- 
ence, 57(1-2), 25-38. 


Maurer, D., Mok, P., Friedrichs, D., A Dellwo, V. (2014). Intelligibility of 
high-pitched vowel sounds in the singing and speaking of a female 
Cantonese Opera singer. In Proceedings of the 15th Conference of 
the International Speech Communication Association, Interspeech 
2014 (pp. 2132-2133). (For an extended version including additional 
material, see the related internet presentation online at 
http://is2014.phones-and-phonemes.org, retrieved October 1, 2015.) 


Maurer, D., Suter, H., Friedrichs, D., & Dellwo, V. (2015). Acoustic charac- 
teristics of voice in music and straight theatre: topics, conceptions, 
questions. In A. Leemann, M-J. Kolly, S. Schmid, & V. Dellwo (Eds.), 
Trends in Phonetics and Phonology. Studies from German-speaking 
Europe (pp. 256-265). Bern/Frankfurt: Peter Lang. 


Moore, G. D. (2006). The physics and psychophysics of music (course 
page for Physics 224, lecture 28, p. 11). Retrieved November 1, 2015, 
from 
http://www.physics.mcgill.ca/~guymoore/ph224/notes/lecture28.pdf. 


References 277 


van Nierop, D. J. P. J., Pols, L. C. W., & Plomp, R. (1973). Frequency 
analysis of Dutch vowels from 25 female speakers. Acta Acustica 
united with Acustica, 29(2), 110-118. 


Patzold, M., & Simpson, A. (1997). Acoustic analysis of German vow- 
els in the Kiel Corpus of Read Speech. Arbeitsberichte des Instituts 
ftir Phonetik und Digitale Sprachverarbeitung der Universitat Kiel (Al- 
PUK), 32, 215-247. 


Peterson, G. E., & Barney, H. L. (1952). Control methods used in a 
study of the vowels. The Journal of the Acoustical Society of Amer- 
ica, 24(2), 175-184. 


Pickett, J. M. (1999). The acoustics of speech communication: funda- 
mentals, speech perception theory, and technology. Boston, MA: 
Allyn & Bacon. 


Pols, L. C. W., Tromp, H. R. C., & Plomp, R. (1973). Frequency analysis 
of Dutch vowels from 50 male speakers. The Journal of the Acousti- 
cal Society of America, 53(4), 1093-1101. 


Potter, R. K., & Steinberg, J. C. (1950). Toward the specification of speech. 
The Journal of the Acoustical Society of America, 22(6), 807-820. 


Ramers, K. H. (1988). Vokalquantitit und -qualität im Deutschen. Lin- 
guistische Arbeiten 213. Tubingen: Niemeyer. 


Rausch, A. (1972). Untersuchungen zur Vokalartikulation im Deutschen. 
In H. Kelz 8 A. Rausch (Eds.), Beiträge zur Phonetik (IPK-Forschungs- 
berichte, vol. 30, pp. 35-82). Hamburg: Buske. 


Schroeder, M. R., 8 Strube, H. W. (1986). Flat-spectrum speech. The 
Journal of the Acoustical Society of America, 79(5), 1580-1583. 


Sharifzadeh, H. R., McLoughlin, I. V., & Russell, M. J. (2012). A compre- 
hensive vowel space for whispered speech. Journal of Voice, 26(2), 
e49-56. 


Sundberg, J. (1978). Synthesis of singing. Swedish Journal of Musicol- 
ogy, 60(1), 107-112. 


Sundberg, J. (1987). The Science of the Singing Voice. DeKalb, IIl.: North- 
ern Illinois University Press. 


Sundberg, J. (2013). Perception of singing. In D. Deutsch (Ed.), The psy- 
chology of music (8rd ed., pp. 69-105). San Diego, CA: Elsevier. 


278 References 


Swerdlin, Y., Smith, J., & Wolfe, J. (2010). The effect of whisper and 
creak vocal mechanisms on vocal tract resonances. The Journal of 
the Acoustical Society of America, 127(4), 2590-2598. 


Trask, R. L. (1996). A dictionary of phonetics and phonology. New York, 
NY: Routledge. 


Titze, |. R., Baken, R. J., Bozeman, K. W., Granqvist, S., Henrich, N., 
Herbst, C. T., ... & Wolfe, J. (2015). Toward a consensus on symbol- 
ic notation of harmonics, resonances, and formants in vocalization. 
The Journal of the Acoustical Society of America, 137(5), 3005-3007. 


Traunmiller, H. (n.d.). The role of F, in vowel perception. Retrieved No- 
vember 1, 2015, from http://www2.ling.su.se/staff/nartmut/i.htm. 


Traunmiiler, H., & Eriksson, A. (1997). A method of measuring formant 
frequencies at high fundamental frequencies. In Proceedings of Euro- 
speech (Vol. 97, No. 1, pp. 477-480). 


Traunmiller, H., A Eriksson, A. (2000). Acoustic effects of variation in 
vocal effort by men, women, and children. The Journal of the Acous- 
tical Society of America, 107(6), 3438-3451. 


Wangler, H.-H. (1981). Atlas deutscher Sprachlaute. Berlin: Akademie- 
Verlag. 


Wolfe, J. (n.d.). Formant: what is a formant? Retrieved November 1, 
2015, from http://www.phys.unsw.edu.au/jw/formant.html. 


Wolfe, J., Garnier, M., 8 Smith, J. (2009). Vocal tract resonances in 
speech, singing, and playing musical instruments. Human Frontier 
Science Program Journal, 3(1), 6-23. 


Wood, S. (1989). The precision of formant frequency measurement from 
spectrograms and by linear prediction. Speech Transmission Labo- 
ratory Quarterly Progress and Status Report, 30(1), 91-93. 


Zee, E. (2003). Frequency analysis of the vowels in Cantonese from 50 
male and 50 female speakers. In Proceedings of the 15th Interna- 
tional Congress of Phonetic Sciences (pp. 1117-1120). 


References 279 


